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Executive  Summary 


This  lepon  describes  the  activities  performed  and  the  progress  made  in  the  development  of 
HARC  (Hear  And  Respond  to  (Zontinuous  speech),  BBN’s  spoken  language  system,  from 
May  1,  1989  to  Febraary  29,  1992.  Significant  progress  has  been  made  both  in  terms  of 
speed  of  understanding  and  accuracy  of  understanding.  New  search  algorithms  that  we 
developed  during  this  period  have  increased  the  speed  of  HARC  by  more  than  three  orders 
of  magnitude.  The  result  has  been  the  first  spt^n  language  system  that  runs  in  real¬ 
time  on  an  off-the-shelf  workstation,  with  no  additional  hardware.  Furdiermore,  real-time 
performance  was  achieved  without  losing  understanding  accuracy.  In  tests  performed  by  the 
National  Institute  of  Standards  and  Technology  (NIST)  tm  data  collected  ^m  the  DARPA 
Airline  Travel  Information  System  (ATIS)  domain,  the  BBN  HARC  system  had  the  best 
speech  recognition  and  speech  understanding  performance.  Another  important  milestone 
in  this  project  has  been  the  demonstration  of  HARC  in  a  military  logistical  transportation 
planning  application. 

Below  is  a  list  of  some  of  the  major  accomplishments  of  this  project: 


•  Extended  the  coverage  and  robusmess  of  the  syntactic  and  semantic  components 
of  our  natural  language  system,  DELPHI,  already  developed  before  the  start  of  the 
project 

•  Developed  a  discourse  component  based  on  our  previous  wori^  with  natural  language 
systems,  to  permit  the  use  of  DELPHI  in  an  interactive  spoken  language  system. 

•  Developed  speech  processing  techniques  to  detect  out-of-vocabulary  words  in  users* 
utterances  and  to  add  them  to  the  system. 

•  Developed  a  new  paradigm,  N-best  for  the  integration  of  knowledge  sources  (both 
in  speech  recognition  and  natural  language  understaiiding).  The  N-best  paradigm 
has  become  the  de  facto  standard  for  integrating  knowledge  sources  in  the  DARPA 
community  and  elsewhere. 

•  Integrated  DELPHI  with  our  speech  recognition  system,  BYBLOS,  to  form  a  com¬ 
plete  spoken  language  system,  HARC. 
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•  Increased  the  speed  of  all  components  to  develop  a  real-time  spoken  language  system. 
The  system  runs  in  real-time  on  an  off-the-shelf  workstation,  with  no  additional 
hardware. 

•  Ported  HARC  to  a  military  logistical  transportation  planning  domain. 

•  Created  a  videotape  to  illustrate  these  capabilities  and  some  potential  applications  of 
spoken  language  technology. 

•  Participated  in  the  DARPA  cross-site  data  collection  effort  and  delivered  the  stipulated 
speech  data  to  NIST  on  time. 

•  Participated  in  all  evaluation  tests. 


We  now  give  a  brief  description  of  these  highlights. 

During  this  project  we  modified  our  namral  language  system,  DELPHI,  to  deal  with 
the  range  of  utterances  that  might  be  used  as  the  input  to  a  spt^en  language  system. 
Towards  this  end,  we  extended  the  coverage  of  the  grammar  and  also  developed  alternative 
processing  strategies  to  allow  DELPHI  to  determine  the  meaning  even  of  utterances  tiiat  the 
system  could  not  completely  parse.  We  also  developed  a  domain-independent  discourse 
component  and  a  domain-dependent  plan-tracking  component  for  the  system,  to  allow 
DELPHI  to  be  used  in  the  natural  dialogue  situations  that  a  spoken  language  system  might 
be  expected  to  encounter. 

We  continued  our  work  in  developing  new  speech  recognition  algorithms.  In  particular, 
we  developed  techniques  that  allow  6YBL0S  u>  recognize  out-of-vocabulary  wends  and  to 
add  them  to  the  system’s  dictionary  so  that  tiie  system  car  recognize  tiiem  in  future  usage. 
This  capability  was  demonstrated  in  the  DARPA  Resource  Management  task  domain. 

One  of  the  major  developments  has  been  the  introduction  of  tiie  N-best  paradigm  for 
integrating  knowledge  sources  in  speech  recognition  and  understanding.  The  idea  is  that  an 
initial  speech  recognition  pass  produces,  not  mily  the  top  scoring  sentence,  but  also  the  top 
N  (ot  N-best)  sentences,  where  N  is  typically  less  than  100.  The  N-best  list  is  titen  rescued 
with  other,  more  expensive  knowledge  sources,  and  die  list  is  rendered.  For  example,  we 
integrate  statistical  trigram  grammars  in  this  way,  as  well  as  cross-word  rescoring.  Ws 
also  use  die  N-best  paradigm  to  integrate  our  speech  recognition  and  natural  language 
components.  The  major  benefits  of  the  N-best  paradigm  are  significant  increases  in  speed 
of  processing  and  improvement  in  overall  recognition  and  understanding  performance.  The 
N-best  algorithm,  which  was  first  presented  in  October  1989,  has  become  the  community 
wide  de  facto  standard  for  integration  of  knowlr  ige  sources,  especially  the  integration  of 
speech  and  natural  language. 
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The  integration  strategy  for  our  speech  recognition  and  natural  language  components 
adopted  for  HARC,  our  spoken  language  system,  is  the  following.  The  speech  recognition 
component  produces  a  list  of  N-best  hypotheses,  which  are  passed  to  the  natural  language 
processor.  (For  this  part  of  our  system,  we  have  found  a  value  of  N=S  to  give  best 
understanding  results.)  The  natural  langua^  system  processes  these  hypotheses  until  a 
hypothesis  which  produces  an  answer  (a  database  retrieval)  is  encountered  or  until  the  list 
of  hypotheses  is  exhausted.  The  complete  HARC  system  has  been  demonstrated  in  the 
ATIS  and  DART  domains  (see  below). 

We  participated  in  all  the  evaluations  sponsored  by  DARPA  and  performed  by  NIST. 
In  the  most  recent  (Februaiy  1992)  evaluation  in  the  ATIS  domain,  our  BYBLOS  system 
had  the  best  speech  recognition  performance  and  our  HARC  system  had  the  best  speech 
understanding  performance  of  all  the  sites  participating  in  the  evaluation. 

A  major  milestone  during  the  course  of  the  project  was  the  demonstration  of  our  spoken 
language  system  HARC  in  a  military  logistical  planning  application:  DART  (Dynamic 
Analysis  Replanning  Tool).  In  this  demonstration,  speech  was  used  to  replace  large  numbers 
of  computer  mouse  clicks  to  retrieve  information  fiom  a  database  of  logistics  information. 
A  videotape  demonstrating  this  application  of  spoken  language  technology  was  made  and 
was  shown  at  the  1991  DARPA  Speech  and  Natural  Language  Workshop,  as  well  as  to 
interested  government  personnel. 

The  basic  evaluation  methodology  used  in  all  the  cross-site  evaluations  involving  nat¬ 
ural  language  was  proposed  and  initially  developed  by  BBN.  Continuing  the  tradition 
of  objective  evaluations  begun  for  speech  recognition  systems,  we  proposed  an  evalua¬ 
tion  technique  in  which  systems  utilizing  a  natural  language  “understanding”  component, 
whether  with  text  or  speech  input,  would  be  evaluated  in  terms  of  the  output  answers  they 
produced  from  a  fixed,  communaUy  available  database.  We  created  the  CAS  (Common 
Answer  Specification)  format  used  for  such  answers  and  created  the  first  software  program 
for  comparing  hypothesis  and  reference  answers.  We  participated  in  the  development  of 
evaluation  strategies  for  pairs  of  related  sentences,  and,  ultimately,  for  evaluating  “full 
sessions”. 

As  part  of  the  cross-site  SLS  data  collection  efifon  that  began  in  June  of  1991,  we 
constructed  a  **^^zard”  data  collection  system  that  incorporated  DELPHI  and  BYBLOS. 
We  collected  over  2200  sentences  in  aU,  fulfilling  the  SLS  cmnmunity’s  agreement  to 
collect  this  amount  of  data  by  die  end  of  August  1991.  (We  were  the  only  site  to  actually 
meet  this  deadline.) 

The  crowning  achievement  of  this  project  was  the  development  of  a  complete  sptdcen 
language  system  working  in  real-time  on  an  off-the-shelf  workstation.  New  search  algo¬ 
rithms  that  resulted  in  more  than  three  orders  of  magnitude  increase  in  ctxnputing  speed  of 
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our  combined  HARC  system  allowed  this  accomplishment  to  take  place.  The  system  can 
perform  real-time  speech  recognition  of  continuous  speech  with  a  1000- word  vocabulary, 
give  an  N-best  list  as  well,  rescore  with  a  trigram  grammar  for  higher  accuracy,  and  perform 
natural  language  understanding  as  well,  all  ccnnpletely  in  software.  The  system  has  been 
demonstrated  on  the  Indigo  workstaticm  Silicon  Graphics  Inc.  and  the  SUN  SparcStation 
2.  The  real-time  system  uses  the  A/D  internal  to  the  workstation  fra-  speech  input  Being  a 
software-only  system,  our  real-time  system  can  be  ported  easily  to  other  applications  and 
other  platforms. 

Ours  is  currently  the  only  system  in  the  DARPA  ctunmunity  that  can  run  in  real-time 
in  the  ATIS  domain  and  is  also  the  system  with  the  best  understanding  performance. 


Report  Outline 


The  structure  of  the  rest  of  this  report  is  as  follows.  Chapter  1  provides  a  general 
overview  of  the  current  architecture  of  HARC  and  details  the  changes  from  the  version 
described  in  [24].  Chapto*  2  describes  the  grammar  formalism,  concentrating  on  the  aspects 
of  our  approach  to  syntax  that  provide  for  maximum  flexibility  and  coverage.  Chapter  3 
describes  the  current  semantic  processing  fhunework,  and  the  way  in  which  the  syntactic 
phenomena  described  in  Chapter  2  are  interpreted  by  the  semantic  component  Chapter 
4  describes  the  parsing  and  fallback  strategies  which  we  have  adopted  to  increase  speed 
and  robusmess  of  syntactic  processing.  Chapter  5  describes  the  discourse  and  task-tracking 
components  we  implemented  during  this  project  Chapter  6  describes  the  N-best  algoritiun 
and  its  use  in  the  integration  strategy  for  speech  and  natural  language  that  we  have  now 
adopted.  Chapter  7  describes  our  development  of  a  real-time  speech  recognition  system. 
Chapter  8  describes  our  development  of  techniques  to  detect  new  words  in  speech  and  to 
allow  the  user  to  add  them  to  the  system.  Chapter  9  describes  our  co-operative  efforts 
with  other  research  sites.  Chapter  10  describes  our  application  of  spoken  language  system 
technology  to  a  military  domain.  Chapter  11  describes  our  lead  role  in  producing  an  evalu¬ 
ation  methodology  for  natural  language  and  spoken  language  systems  and  our  participation 
in  the  coimnon  data  collection  effort  Chapter  12  gives  a  list  of  the  papers  published  by 
the  members  of  the  project  over  the  course  of  the  project,  as  well  as  oral  presentations 
and  conference  submissions  that  have  been  accepted  ffx  presentation.  The  Bibliography 
contains  references  pertinent  to  this  wmit 


Chapter  1 


System  Overview 


This  ch^ter  provides  a  general  overview  of  HARC,  die  BBN  Spoken  Language  Syston. 
The  general  architecture  of  HARC  is  shown  in  Figure  1.1.  The  user  presents  a  qxdcen 
utterance  to  BYBLOS,  our  ctmtinuous  speech  recognidcm  system,  which  produces  a  list 
of  N-best  possible  hypotheses  as  to  the  sentence  sptdten.  This  N-best  list  is  passed  to 
DELPHI,  our  natural  language  understanding  system,  which  processes  the  N-best  until  it 
finds  a  hypothesis  that  produces  a  data  base  retrieval  command  This  is  then  passed  to  die 
application  database,  producing  an  answer  that  is  displayed  to  the  user. 

HARC 


Figure  1.1:  The  Architecture  of  HARC 

This  chapter  describes,  in  broad  terms,  the  improvements  to  DELHII,  our  natural  lan¬ 
guage  processing  system  (Section  1.1),  BYBLOS,  our  speech  lecognitkm  systmn  (Section 
U),  and  the  integration  of  these  two  comptxients  into  a  spoken  language  system  (Section 
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1.3),  from  the  version  of  the  system  described  in  [24]  [25].  The  chapter  concludes  widi  a 
summary  of  implementation  specifications  (Section  1.4). 


1.1  DELPHI  (Natural  Language  Component) 


During  the  predecessor  to  this  contract,  we  produced  the  initial  version  of  the  DELPHI 
natural  language  understanding  system.  During  die  course  of  that  project,  we  designed 
and  implemented  a  grammar  fonnalism,  adopted  an  all-paths  parsing  algmithm,  and  im¬ 
plemented  a  Montague-style  [65]  semantics,  in  which  each  rule  of  the  grammar  had  an 
associated  semantic  interpretation  rule.  Owing  to  the  fairiy  simple  nature  of  d^  triplica¬ 
tion  domain,  that  system  had  a  purely  sentential  semantics,  with  no  discourse  capabilities. 

During  the  course  of  this  project,  we  have  made  a  number  of  changes  to  DELPHI; 


•  We  have  reworked  the  grammar  rules  and  grammar  formalism  to  produce  a  grammar 
that  can  be  more  efficiendy  parsed  and  interpreted. 

•  We  have  integrated  semantics  with  syntax,  to  improve  system  performance  and  ro- 
busmess. 

•  We  have  added  a  discourse  component  to  the  system,  to  allow  it  to  work  in  real 
application  domains. 


Our  main  research  goal  during  this  project  has  been  to  take  die  tools  cunendy  available 
to  grammar  developers  (such  as  unification,  efficient  parsing  algorithms,  etc.)  and  to 
integrate  them  into  the  most  efficient  and  robust  overall  system  possible.  We  have  tried  to 
not  rely  on  any  one  mechanism  too  heavily  but  ratter  to  deploy  each  where  it  functions 
optimally. 

The  following  sectkms  give  a  rou^  overview  of  our  changes  to  die  grammar,  semantics, 
and  parser,  and  the  current  implementatimi  of  our  discourse  component 


1.1.1  The  Syntactic  Component:  the  Grammar  Formalism 

The  original  DELPHI  grammar  formalism  was  a  relatively  “stripped  down”  version  of 
the  cmnplex-feature  based  grammar  formalisms  that  are  common  tcxlay;  see  [87]  for  an 
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overview  of  many  such  formalisms.  It  was,  in  effect,  an  implementation  in  LISP  of  the 
common  Definite  Clause  Grammar  (DCG)  formalism  [74]  available  in  Prolog.  In  particular, 
it  did  not  support  the  use  of  the  relatively  sophisticated  mechanisms  that  are  part  of  other 
formalisms,  such  as  feature  disjunction,  feature  negatitm,  metarules,  optional  arguments, 
and  the  use  of  attribute-value  pairs,  rather  than  positional  arguments.  In  general,  we  have 
found  this  Spartan  character  to  actually  be  of  great  use  in  writing  a  large  grammar  and  have 
mostly  maintained  it  However,  we  have  introduced  a  limited  use  of  feature  disjunction, 
described  in  Section  4.1.1.  In  addition  to  this  change,  there  have  been  several  otha*  major 
changes  in  the  grammar  proper. 

We  have  reduced  the  size  of  many  subparts  of  the  grammar  by  adopting  a  recursive 
structure  for  various  categories.  For  example,  in  earlier  versions  of  the  grammar,  there 
were  many  rules  for  different  expansions  of  the  common  noun  phrase.  The  noun  phrase 
grammar  now  consists  of  a  CORE-NP  constituent  which  spans  the  portion  of  the  noun 
phrase  from  the  determiner  to  its  head  (e.g.  “which  flights”,  “how  many  airlines”,  etc), 
a  unit  production  that  expands  NP  as  CORE-NP,  and  recursive  expansions  of  NP,  each 
adding  a  (typically  single)  constituent  that  may  be  interpreted  as  a  complement  or  adjunct 
to  the  noun  phrase,  such  as  relative  clauses  Cthat  leave  on  Sunday”),  prepositional  phrases 
(“from  Boston”),  participial  phrases  (“leaving  afrer  6  PM”),  etc.  We  Ascuss  this  aspect  of 
the  grammar  in  Section  2.3. 

To  aid  in  efficient  parsing,  we  have  split  apart  various  categories  that  were  formerly 
unitary.  For  example,  the  earlier  grammar  contained  a  single  S  (Sentence)  category  that 
was  used  for  main  clauses  ("What  flights  do  you  have  from  Boston  to  Baltimore?"),  relative 
clauses  (“a  flight  that  arrives  around  eleven  o'clock  in  the  morning”),  and  other  subordinate 
clauses  (“I’d  like  to  know  what  flights  United  Airline  has  from  Dallas  to  San  Francisco.”) 
However,  these  types  of  clauses  have  very  different  syntactic  properties  and  treating  them 
as  a  single  category  did  not  permit  die  optimal  parsing  strategy.  We  have  since  split  S  into 
three  different  categories.  More  details  of  similar  sorts  of  category  division  are  discussed 
in  Section  4.1.2. 

The  earlier  DELPHI  grammar  made  sparing  use  of  constraint  relations.  These  are 
grammar  rules  or  constituents  thereof  that  do  not  derive  elements  in  the  linguistic  input  but 
rather  solve  equations  constraining  die  well-formedness  of  the  parse  produced.  An  example 
of  such  a  constraint  relation  from  the  earlier  grammar  was  P-MIN,  which  ctnnputed  die 
person  of  a  ccmjoined  noun  phrase  (for  the  purposes  of  subjea  verb  agreement).  In  die  cur¬ 
rent  project,  we  have  made  extensive  use  of  constraint  relations,  for  a  number  of  purposes, 
both  syntactic  and  semantic.  These  are  discussed  in  more  detail  in  Sections  2.2  and  3.6. 
We  have  also  introduced  a  mechanism  to  compile  constraint  relatitxis  into  executable  code, 
so  that  our  grammar  formalism  now  contains  declarative  rules  with  attached  procedures. 

Finally,  we  have  reduced  the  size  of  the  grammar  by  moving  terminal  elements  like  the 
articles  (“a”,  “the”,  etc),  prepositions,  (“in”,  “on”,  “at”,  etc),  and  other  grammatical  for- 
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matives  (loosely,  form  words  rather  than  cmitent  words)  which  were  previously  introduced 
diiecdy  in  the  grammar,  into  die  lexicon. 

The  current  DELPHI  grammar  contains  453  rules  to  handle  the  general  syntactic  con- 
strucdons  of  English. 


1.U  The  Semantic  Component 


In  the  earlier  version  of  the  DELPHI  system,  a  single  semantic  rule  was  assigned  to  each 
grammar  rule.  Semantic  processing  did  not  take  place  during  parsing.  Rather,  die  parser 
would  return  all  parses  syntactically  compatible  with  die  linguistic  input  A  post-processor 
would  then  process  each  parse,  traversing  the  parse  tree  in  a  recursive  descent  and  filtering 
out  all  parses  that  were  not  semantically  coherent  This  generate  and  test  paradigm,  given 
the  wide  range  of  syntactic  analyses  compatible  with  a  linguistic  input  without  semantic 
constraints,  was  not  optimal.  During  diis  project  we  have  worked  to  establish  die  best 
link-up  between  syntax  and  semantics.  Our  current  architecture  applies  semantic  type 
constraints  during  parsing  (fw  example,  in  the  ATIS  drxnain,  a  Noun  Phrase  object  of  the 
verb  “leave”  must  be  either  an  airport  or  a  city)  and  constructs  larger  senumtic  formulae 
out  of  the  interpretations  of  syntactic  subcomponents  only  at  levels  where  aU  the  elements 
that  contribute  to  that  formula  are  present  For  example,  in  interpreting  the  utterance  "T>oes 
United  Airline  have  any  flights  from  Dallas  to  San  Francisco?”  it  is  (mly  at  the  level  of 
the  entire  sentence  that  die  semantic  interpretatirms  of  die  Nam  Hirase  “United  Airline”, 
the  Verb  “have”  and  “any  flights  from  Dallas  to  San  Francisco”  are  composed.  Qucially, 
there  is  no  partial  interpietatioi  of  the  Verb  Phrase  “have  any  fli^its  fron  Dallas  to  San 
Francisco”,  as  there  would  be  in  most  unification  systems.  The  evolution  of  our  qrproach 
to  semantic  interpietation  is  described  in  Chapter  3. 


1.U  Parsing 


The  parsing  algorithm  originally  used  to  parse  the  DELPHI  grammar  was  based  on  die 
Graham,  Harrison,  Ruzzo  (GHR)  algoithm  [39],  itself  an  extension  of  the  familiar  Cocke- 
Kasami- Younger  (CKY)  algtvidun  for  context-free  grammars.  Our  algorithm  was,  in  turn, 
an  extension  of  the  GHR  algorithm  to  support  complex  feature-based  grammars — die  origi¬ 
nal  GHR  algorithm  only  handles  true  context-free  grammars.  In  our  earlier  implemoitatiai 
of  the  GHR  algorithm,  we  omitted  dw  prediction  portioi  of  die  algorithm:  this  is  a  medi- 
anism  that  utilizes  die  current  initial  syntactic  context  to  try  to  predict  the  (most  likely) 
following  syntactic  constituents).  The  reason  fa  the  omission  is  that  the  notion  of  syntactic 
constituent  in  a  feature  based  gramnuu  is  quite  complex,  enconpassing  na  only  category 
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label,  but  syntactic  features  as  well  (for  example,  at  the  level  of  category  label,  a  singular 
NP  like  “the  flight”  and  a  plural  NP  like  “flights”  are  the  same  type  of  constituent;  however 
if  the  number  feature  is  included,  they  are  distinct).  In  earlier  wOTk,  we  discovered  that 
prediction  based  purely  on  category  label  was  too  weak  to  improve  parsing  performance, 
while  prediction  using  all  syntactic  features  was  too  computaticxudly  expensive.  However, 
the  division  of  categories  such  as  S  into  multiple  categories  allows  us  to  use  prediction  by 
category  label  alone  to  good  effect  This  is  discussed  in  more  detail  in  Section  4.1.2. 

While  this  modification  to  the  GHR  algoritiun  increased  processing  speed,  we  still  found 
the  processing  time  to  be  too  slow  for  the  requirements  of  a  real-time  spoken  language 
system.  Therefore,  to  overcome  the  limitations  of  the  GHR  algorithm,  we  created  a  second 
algorithm,  which  utilizes  the  same  parse  tables  and  chart  structures  as  our  implementation 
of  the  GHR  algorithm,  and  which  work  in  a  best  first  manner,  using  statistical  grammar  rule 
scores  derived  from  a  training  corpus.  The  agenda  mechanism  we  have  used  to  implement 
this  best-first  search  is  described  in  Section  4.2.  This  best-first  algorithm  is  what  we 
normally  use  in  our  current  system,  but  our  GHR  parser  is  also  available  for  situations  in 
which  the  best-first  algorithm  is  not  appropriate. 

Finally,  we  have  introduced  fallback  processing  inm  our  use  of  granunatical  informa¬ 
tion.  Given  the  presence  of  false-starts,  idiolectal  variation,  and  other  extra-grammatical 
elements  in  spoken  utterances,  it  is  virtually  impossible  to  cover  all  spoken  input  witii  a 
restrictive  grammar.  The  details  of  the  mechanisms  we  have  implemented  are  presented  in 
Section  4.3. 


1.1.4  Discourse 


The  application  domain  of  the  initial  version  of  the  DELPHI  System  was  Resource  Man¬ 
agement.  Since  most  of  the  queries  in  this  domain  were  simple  database  retrieval  queries 
or  commands  to  the  display  system,  there  was  little  need  for  discourse  processing.  How¬ 
ever,  the  full  range  of  appUcations  of  spoken  language  systems  clearly  requires  discourse 
capabilities.  During  the  current  project  we  have  developed  a  discourse  component  that 
utilizes  domain  independent  mechaiusms  for  pronoun  interpretation  based  on  earlier  work 
with  the  Janus  system.  We  have  also  developed  a  plan  traclrer  that  utilizes  task  and  dmnain 
specific  informatitm  to  guide  discourse  interpretatitm.  Our  work  in  this  area  is  described 
in  Chapter  5. 
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1.2  BYBLOS  (Speech  Recognition  Component) 


The  basic  ^)eech  recognition  technology  used  in  this  contract  was  developed  under  a 
separate  contract,  and  under  that  contract  we  continue  to  improve  the  accuracy  of  our 
speech  recognition  system,  BYBLOS. 

Under  this  contract,  our  speech  recognition  work  focussed  cm  developing  new,  more 
efficient  methcxls  for  integrating  speech  recognition  with  natural  language,  real-time  recog¬ 
nition  on  commercially  available  hardware,  developing  a  real-time  demonstration,  and 
building  several  specific  components  for  a  usefid  speech  recognition  system.  Specifically; 


•  We  reduced  the  computation  needed  for  speech  recogniticm  by  taking  advantage  of 
the  back-o£f  structure  of  statistical  n-gram  grammars.  We  also  greatly  increased  the 
efficiency  of  the  speech  recognition  search  to  the  point  where  it  can  run  in  real 
time  cm  an  SGI  4D/35  (which  is  about  3-4  times  faster  than  a  SUN  4/280)  widi 
vocabularies  of  over  1,000  words.  Thus  no  special  purpose  hardware  is  required. 

•  We  developed  the  N-Best  Paradigm  for  integrating  speech  recognition  and  natural 
language.  We  also  developed  the  first  exact  algorithm  for  finding  the  N-Best  sentence 
hypotheses,  and  then  developed  sevmal  more  efficient  approximate  algorithms. 

•  We  applied  our  Forward-Backward  Search  technique  (developed  in  the  previous  con¬ 
tract  for  integrating  Multiple  Knowledge  Sources)  to  speed  up  the  N-Best  search  by 
an  additional  factor  of  40. 

•  We  developed  the  first  algorithm  for  detecting  when  a  speaker  says  a  word  outside 
the  vocabulary.  We  also  implemented  a  method  that  allows  a  naive  user  to  add  new 
words  to  the  dictionary. 


1.2.1  Recognition  Speed 

In  the  previous  ctmtract  for  integrating  speech  recognition  widt  natural  language,  die  speech 
recognition  component  produced  a  dense  lattice  of  word  hypodieses,  which  were  dien 
searched  (parsed)  by  the  natural  language  component  At  the  beginning  of  diis  contract 
a  1-best  search  required  about  100  times  real  time  to  run  on  a  SUN  4/280.  Creating  the 
word  lattice  required  about  1,000  times  real  time.  We  clearly  needed  another  fqiproach  to 
increase  the  speed  of  our  speech  recognition. 

In  a  previous  contract  for  using  parallel  processors  for  speech  recognition  we  showed 
that  it  was  possible  to  use  a  large  parallel  processor  to  speed  up  the  recognition  search.  In 


BBN  Systems  and  Technologies  BBN  Report  No.  7715  19 

particular,  we  used  a  Butterfly^^  parallel  processor  with  98  nodes  to  speed  up  a  simple 
search  by  a  factor  of  77.  In  addition,  we  used  32  nodes  to  speed  up  a  more  complex 
grammar  search  by  a  factor  of  16.  These  improvements  in  recognition  speed,  while  very 
impressive,  were  not  sufficient  for  our  purposes  and  did  not  justify  the  significant  cost  of 
the  parallel  processors  or  the  additional  time  taken  to  program  a  parallel  machine. 

A  separate  solution  for  increased  speed  was  being  investigated  at  SRI  and  UC  Beikeley, 
namely,  the  design  of  special-purpose  VLSI  chips  for  real-time  speech  recognition.  That 
effort  was  started  in  1988  but  had  not  achieved  its  objective  by  the  end  of  this  contract  in 
February  1992. 

At  BBN,  we  took  a  very  different  point  of  view.  Our  primary  goal  in  terms  of  recog¬ 
nition  speed  was  to  eliminate  the  need  for  spinal  purpose  or  parallel  hardware  for  the 
recognition  search.  This  required  that  we  greatly  reduce  the  time  needed  for  die  basic 
recognition  search  algorithmically.  The  type  of  language  model  we  use  most  often  is  a 
fully-connected  statistical  n-gram  grammar,  which  represents  the  probability  of  each  word 
given  the  previous  one  or  two  words.  We  reduced  the  computation  required  for  this  kind 
of  grammar  significantly  by  taking  advantage  of  the  back-off  structure  of  this  general  type 
of  grammar.  In  particular,  in  a  bigram  grammar,  instead  of  having  a  separate  grammar 
transition  for  each  pair  of  words,  we  realized  that  we  only  needed  to  include  those  tran¬ 
sitions  corresponding  to  word  pairs  that  were  actually  observed  in  the  training.  All  other 
transitions  are  accommodated  by  a  transition  with  a  back-off  probability  from  the  end  of 
each  word  to  a  “unigram  node”  from  which  there  is  a  unigram  transition  to  the  beginning 
of  each  word.  This  simplification  greatly  reduced  the  computation  needed  with  no  change 
in  accuracy. 

In  addition  to  the  above  approach,  we  also  reimplemented  die  recognition  program, 
using  good  software  engineering  practices,  in  mder  to  make  it  more  efficient  The  net 
result  is  that  a  1-Best  recognition  with  over  1,000  words  can  now  be  performed  in  real 
time  on  a  workstation.  This  means  that  as  soon  as  a  speaker  has  stopped  speaking,  the 
program  will  print  out  the  most  likely  sentence  hypothesis. 

Finally,  we  developed  a  real-time  demonstration  system  using  commercially  available 
hardware.  In  particular,  we  used  a  workstation  for  the  recognition  search,  widi  a  readily 
available  signd  processing  board  to  perform  the  front  end  signal  analysis  and  vector  quan¬ 
tization.  This  demonstration,  which  was  given  in  August  1991,  was,  to  our  knowledge,  the 
first  real-time  demonstraticm  of  continuous  speech  recognition  with  1,000  words  (without 
excessive  loss  of  accuracy).  The  fact  that  it  was  performed  on  commercially  available 
hardware  was  therefore  even  more  astounding. 

Subsequently,  a  separate  effort  at  BBN  to  commercialize  speech  recognition  technology 
developed  a  much  faster  version  of  the  fiont  end  signal  processing  that  was  able  to  run 
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completely  on  the  workstation  in  a  fraction  of  real  time.  Therefore,  by  using  a  workstation 
with  a  built-in  A/D  capability  (like  the  SGI  4D/35  or  the  SUN  SparcStation  2),  we  now 
have  real-time  recognition  operating  completely  on  a  standard  workstation.  This  system 
was  demonstrated  at  the  February  1992  DARPA  Speech  and  Natural  Language  Workshop. 


1.2J2  The  N-Best  Paradigm  and  Algorithms  for  Finding  N-Best  Hypotheses 

We  observed  that  a  tight  coupling  of  several  knowledge  sources  often  resulted  in  very  large 
computation.  We  postulated  that  a  more  efficient  integration  of  several  knowledge  sources 
might  be  achieved  by  judiciously  ordering  the  knowledge  sources  so  that  those  that  were 
powerful  and  yet  inexpensive  to  apply  were  used  first 

We  developed  the  N-Best  Paradigm  in  which  the  BYBLOS  system  now  produces  not 
one,  but  several  (the  N  Best)  alternative  sentence  hypotheses  using  a  discrete  HMM  with  a 
bigram  grammar.  These  hypotheses  are  first  rescored  using  more  detailed  speech  knowledge 
sources,  including  between-wotd  coarticulation  models,  serpi-continuous  density  HMMs 
and  trigram  statistical  language  models.  Then,  the  reordered  list  is  given  to  the  DEL¬ 
PHI  natural  language  system,  which  searches  for  the  highest  scoring  hypothesis  tiiat  is 
meaningful  and  results  in  a  valid  answer. 

I 

We  considered  and  developed  several  algorithms  for  finding  the  N-Best  hypotheses. 
These  included  the  Exact  Sentence-Dependent  N-Best  algorithm,  the  approximate  Wtnd- 
Dependent  N-Best  algorithm,  and  the  very  fast  Lattice  N-Best  algorithm.  The  Exact  algo¬ 
rithm  can  be  shown  to  guarantee  tiiat  we  can  find  the  ctnrect  ’Torward  probability”  for  aU  of 
the  hypotheses  that  are  within  a  threshold  of  the  most  likely  hypothesis.  The  ctxnputation 
required  is  only  linear  in  the  number  of  hypotheses  found. 

The  approximate  N-Best  algorithms  were  developed  in  an  attempt  to  reduce  the  com¬ 
putation  needed  without  sacrificing  accuracy.  We  chose  to  use  the  Word-Dependent  N-Best 
algmithm,  because  we  found  that  it  was  empirically  as  accurate  as  the  exact  algorithm,  and 
yet  does  not  require  as  much  computatitm. 


1.2J  Forward-Backward  Search  Algorithm 


Even  though  the  new  Word-Dependent  N-Best  algorithm  was  quite  fast,  it  was  not  fast 
enough  to  be  performed  in  real  time.  We  needed  a  way  to  speed  it  up  by  more  than  an 
mder  of  magnitude.  In  the  previous  contract  tm  combining  Multiple  Knowledge  Sources 
we  had  develtqied  the  Forwaid-Backward  Search  Algoritiim.  In  this  method,  we  use  a 
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simplified  foward  search  to  speed  up  the  computation  of  a  more  complex  backward  search. 
We  applied  the  Forward-Backward  Search  Algoridun  to  speed  up  ^  conputation  of  the 
N-Best  search  by  a  factor  of  40  widt  no  loss  in  accuracy.  This  meant  tluu  dte  N-Best 
computation,  wdiich  only  begins  after  the  speaker  has  stopped,  is  completed  in  a  small 
fraction  of  real  time,  or  typically  a  delay  of  one  second.  We  felt  that  this  delay  was  not 
excessive  for  most  iq>plications. 


1.2.4  Detecting  and  Adding  New  Words 


One  problem  in  a  large  vocabulary  speech  recognition  system  is  that  die  user  can  not 
remember  which  words  are  in  the  vocabulary.  Often,  a  user  will  say  a  word  outside  of  die 
vocabulary  resulting  in  a  recognition  error.  However,  the  user  will  simply  try  to  say  the 
sentence  again,  not  realizing  that  die  word  is  not  in  the  vocabulary.  We  felt  diat  it  would 
be  quite  desirable  for  die  system  to  warn  the  user  when  a  new  word  has  been  spoken,  even 
though  the  system  has  no  idea  ahead  of  time  of  what  new  wmds  may  be  spdcen. 

We  developed  a  technique  for  detecting  mit-of-vocabulary  words.  The  technique  is 
based  on  having  a  general  alternate  model  for  new  words.  The  model,  which  is  ctMistnicted 
from  a  netwoik  of  phoneme  models,  is  designed  to  match  any  word  reasonably  well. 
However,  it  does  not  match  existing  words  as  well  as  the  specific  models  for  those  wends. 
The  technique  was  tested  using  the  DARPA  Resource  Management  carpus.  When  opoating 
in  speakCT-depeixlent  mode,  we  were  able  to  detea  over  70%  of  the  new  words  spoken, 
while  falsely  detecting  new  words  in  only  1%  of  the  sentences.  In  speaker-independent 
mode  the  detection  rate  dropped  to  50%  with  a  false  accqitance  rale  of  2%. 

Once  the  system  has  detected  a  new  word,  the  user  needs  a  way  to  add  it,  even  tiiough 
they  do  not  know  how  to  create  phonetic  spellings.  The  system  asks  the  user  to  type  the 
wo^  and  say  it  once.  It  then  uses  the  combination  of  the  typed  spelling  and  tiie  spoken 
utterance  to  determine  tiie  most  likely  phonetic  spelling  of  the  new  word  so  that  it  can  be 
incorporated  in  the  system. 


1.3  Speech  and  Natural  Language  Integration 


In  our  initial  integration  of  Speech  Recognition  and  Natural  Language  Processing  tech¬ 
nologies,  we  used  a  word  lattice  as  the  data  structure  to  interface  the  two  components. 
The  speech  component  would  produce  a  lattice  of  all  tire  word  hypotheses  fouird  and  natu¬ 
ral  language  would  find  all  syntactically  permissible  and  semantically  acceptable  utterance 
hypotheses  covering  the  speech  input  This  was  a  computatitmally  erqrensive  procedure. 
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During  this  project  we  have  moved  to  using  the  N-best  hypotheses  lists  produced  by  speech 
as  the  data  structure  for  interfacing  the  two  components.  DELPHI  processes  these  hypothe¬ 
ses  until  either  a  hypothesis  which  produces  an  answer  (a  database  retrieval)  is  encountered 
or  until  the  number  of  hypotheses  processed  crosses  a  threshold,  which  can  be  varied.  We 
have  experimented  with  a  number  of  variations  on  this  strategy,  and  have  found  that  a 
threshold  of  5  produces  the  best  results.  We  have  also  experimented  with  optimal  ways  of 
incorporating  our  fallback  strategies.  We  have  found  that  a  two  pass  scheme,  in  which  the 
N-best  hypotheses  are  first  processed  with  fallback  processing  disabled,  and  in  which  faU- 
back  is  utilized  as  a  second  pass  only  if  no  answer  is  obtained,  is  superitv  to  architectures 
which  either  totally  omit  fallback  processing  or  use  the  system  with  fallback  processing  as 
the  first  pass.  More  details  are  given  in  Section  6.4. 


1.4  Implementation 


DELPHI  is  implemented  in  Common  LISP.  The  earliest  vmions  of  DELPHI  were  de¬ 
veloped  on  Symbolics  3600-class  LISP  Machines  and  Tl  Explorers.  There  ate  currently 
versions  on  Sim  RISC-based  workstations  and  Silicon  Graphics  machines  running  Lucid 
Common  LISP  undo*  UNIX.  The  main  computational  core  of  the  system  is  machine- 
independent;  the  identical  c  de  runs  on  all  the  systems  named.  A  device-independent 
loader  file  allows  the  system  code  to  be  run  on  all  these  machines;  it  is  conditionalized 
to  handle  input/output  differences  among  them.  During  this  project,  we  also  developed  a 
“configuration"  mechanism  that  parametrizes  the  load  procedure  for  the  DELPHI  system 
to  load  in  collections  of  data  files  Oexicon  and  semantics)  for  different  domains.  This  has 
allowed  us  to  switch  quickly  from  one  version  of  the  ATIS  database  to  another,  as  well  as 
to  configurations  feu  other  domains. 

BYBLOS  is  implemented  in  C  and  runs  on  UNIX  workstations.  HARC  runs  on  UNIX 
machines,  using  standard  UNIX  facilities  to  interface  DELPHI  and  BYBLOS.  The  user 
display  for  HARC  is  written  using  X  Windows. 


L 


Chapter  2 


Syntax  and  Grammar 


In  this  section,  we  briefly  describe  the  basic  grammar  formalism  used  in  DELPHI  and  the 
changes  that  have  been  made  to  the  formalism  and  to  the  grammar  during  the  course  of 
this  project  One  of  the  major  changes  made  to  the  grammar  is  the  wide-spread  use  of 
a  device  which  is  frequently  made  use  of  in  Definite  Clause  Grammar  woric,  but  which 
has  less  often  been  utilized  in  the  work  tm  unification  gnrrmar  that  has  speared  more 
recently  ([87]  [66]).  This  is  the  right-hand  side  “constraint  relation”  that  dms  not  derive 
any  constituent  of  the  utterance,  but  rather  serves  to  constrain  in  various  ways  the  feature 
assignments  of  those  rule  elements  which  do.  Constraint  relation  elements  are  thus  to  be 
distinguished  from  other  non-terminals  which  derive  empty  constituents,  such  as  gq>s  or 
zero  morphemes.  We  show  their  utility  for  syntactic  purposes  in  this  chapter  and  discuss 
their  uses  for  semantic  interpretation  in  Section  3.6. 

Another  major  change  is  the  replacement  of  many  idiosyncratic  rules  for  the  expansion 
of  different  categories  with  a  recursive  “layer  of  the  onion”  type  structure  which  has  allowed 
us  to  reduce  the  number  of  rules  for  the  lexical  categories  NP,  VP,  and  AP,  while  retaining 
or  even  increasing  coverage. 

We  have  also  split  what  were  formeriy  single  categories  into  multiple  distinct  categories 
in  order  to  make  better  use  of  prediction  in  parsing.  Since  the  reasons  for  fr.:  q>lits  are 
closely  tied  to  flieir  benefits  to  parsing,  we  merely  mention  diese  changes  here  and  describe 
them  in  detail  in  Section  4.1.2,  which  discusses  predictirai. 

Finally,  we  have  introduced  relational  labels  into  the  ri^t  hand  side  of  gnunmar  rules. 
Labels  include  the  traditional  “Head”  (for  example,  in  the  Noun  Phrase  “flights  from 
Boston  to  Denver”,  “flights”  is  the  head),  grammatical  relations  such  as  “Subject”,  ‘TMrect 
Object”,  etc,  and  ad  hoc  labels  for  relations  for  which  there  exist  no  traditional  names.  The 
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main  utility  of  such  labels  is  that  they  provide  a  more  uniform  and  simpler  interface  to 
semantics.  We  introduce  them  briefly  here  in  Section  2.4  and  discuss  their  role  in  semantic 
interpretation  in  Section  3.5.  The  use  of  relational  labels  is  more  than  just  a  convenience. 
In  our  current  view  of  the  relation  between  syntax  and  semantics,  semantic  interpretation 
is  seen  as  a  process  operating  on  a  sequence  of  messages  characterizing  local  grammatical 
relations  among  phrases,  rather  than  as  a  recursive  tree  walk  over  a  globally  complete  and 
coherent  parse  tree.  The  combination  of  incremental  semantic  interpretation  and  statistical 
control  of  the  parsing  process  (described  in  Section  4.2)  makes  it  feasible  to  reconstruct 
local  grammatical  relations  with  substantial  accuracy,  even  when  a  global  parse  cannot 
be  obtained.  Grammatical  relations  provide  the  interface  between  syntactic  processing 
and  semantic  interpretation,  and  standard  global  parsing  is  viewed  as  merely  one  way  to 
obtain  evidence  for  the  existence  of  such  grammatical  relations  in  an  input  string.  This 
focus  on  grammatical  relations  has  lead  to  substantial  simplification  of  both  grammar  and 
semantic  rules,  and  will  facilitate  our  ultimate  aim  of  acquiring  synta  ;tic,  semantic  and 
lexical  knowledge  by  largely  automatic  means. 

Note  that  all  the  rules  presented  in  Sections  2. 1-2.3  are  in  tiieir  current  forms.  However, 
since  we  only  introduce  the  use  of  relational  labels  in  Section  2.4,  this  being  our  most  radical 
departure  from  standard  complex  feature  based  grammar  formalisms,  we  have  suppressed 
the  relational  labels  in  the  rules  in  these  earlier  sections  for  the  sake  of  exposition. 

We  begin  our  discussion  of  grammatical  changes  by  first  reviewing  the  basics  of  our 
grammar  formalism;  full  details  appear  in  [24]. 


2.1  The  Grammar  Formalism 


DELPHI  uses  a  grammar  formalism  based  on  annotated  phrase  structure  rules.  While  it  is  in 
the  general  tradition  of  augmented  phrase  structure  grammars,  its  immediate  inspiration  is 
Definite  Clause  Grammars  (DCXjs)  [74].  In  such  grammars,  rules  are  made  up  of  elements 
that  are  not  atomic  categories  but,  rather,  are  complex  symbols  consisting  of  a  category 
label  and  feature  specifications.  Features  (also  called  arguments)  may  be  either  constants — 
indicated  by  lists  in  DELPHI— or  variables — vindicated  by  symbols  with  a  leading  colrai 
( : ).  Identity  of  variables  in  the  different  elements  of  a  rule  is  used  to  enforce  agreement 
in  the  feature  indicated  by  the  variable.  A  (somewhat  simplified)  example  of  an  actual 
grammar  rule,  which  introduces  conjoined  Noun  Phrases,  is  shown  in  Figure  2.1. 

This  rule  states  that  an  NP  can  consist  of  an  NP,  followed  by  a  conjunction  (CDNJ), 
followed  by  another  NP.  Here,  the  identity  of  the  :  POSS  feature,  which  indicates  whether 
a  Noun  Phrase  is  possessive  (“Delta’s”)  or  not  (“Delta”),  on  the  lefthand  NP  and  in  die  two 
NPs  on  the  righthand,  requires  that  all  conjuncts  be  either  possessive  or  non-possessive. 
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(KP  (A6R  :P  (PLURAL))  (REAUIP  :RSALZ)  :POSS  :1IH  . . .  :  DET-CLASS  ...) 

(MP  (AGR  :PSRSOMX  :NOMX)  (RSAIUP  :RBALX)  :POSS  :1IB  . . . : DBT-CLASSX  ...) 
(COMJ  ...) 

(HP  (AGR  tPERSONY  rHUIff)  (RZAXHP  tRBALY)  :POSS  :HB  ...  :]»T-CLASSY  ...) 
{P-MXH  :PBRSOCnc  zRERSONY  :P} 

{NPTYPS-FZLISR  :RSALX  rRBALY  :REALZ} 

Figure  2.1:  A  Grammar  Rule  for  Cmijoined  N(Hm  Phrases 

but  does  not  allow  a  mixed  conjunction.  Similaiiy,  die  identity  of  :1VH,  which  indicates 
whether  a  Noun  Phrase  contains  a  question  element  C‘who”,  “which  flights”)  or  not  (“them”, 
“the  flights”)  in  the  lefthand  NP  and  the  two  conjuncts  requires  a  conjoined  NP  to  ctmsist 
entirely  of  question  or  non-question  Noun  Phrases.  On  the  other  hand,  the  ^rpearance  of 
the  distinct  variables  :  DET-CLASS,  :DET-CLASSX  and  :DET-CLASSy  on  the  left  and 
right  hand  sides  of  the  rule,  allows  Noun  Phrases  with  different  classes  of  determiners  to 
be  conjoined. 

P-MIN  and  NPTYPE-FILTER  are  ctmstraint  relations  whose  function  will  be  discussed 
in  the  next  section.  We  introduce  here  the  convention  of  indicating  constraint  relations 
with  curly  braces  ({}). 

This  formalism,  like  DCXjs  in  general,  maintains  the  term  unification  practice  of  using 
functors  with  obligatory,  positional  arguments,  ratho’  than  functoriess  feature  structures 
with  optional,  labelled  argmnents,  as  in  much  recent  woric,  such  as  GPSG  [37],  LFG  [28], 
and  PATR-n  [88].  There  are  several  reasons  for  this.  First  of  all,  we  find  the  functtn*  very 
useful  as  a  way  of  indicating  just  what  features  are  allowed  in  a  given  structure  (and  it  is 
interesting  to  note  attempts  to  restore  it  for  just  this  purpose  in  [64].)  Second,  it  is  relatively 
straightforward  to  have  a  simple  syntactic  checker  that  ensures  that  all  grammar  rules  are 
well-formed.  In  a  grammar  as  large  as  DELPHI’S,  having  tlw  ability  to  automatically 
make  sure  that  all  rules  are  well-formed  is  ix>  small  advantage.  Hrudly,  argument  labels 
contribute  their  own  clutter  to  the  rule.  They  would  thus  seem  to  be  a  notational  win  only 
if  more  than  half  the  features  of  a  given  element  are  “don’t-cares”,  i.e.  are  not  required  to 
agree  with  features  elsewdiere  in  the  rule.  In  our  grammar,  however,  we  find  tm  average 
that  only  about  3%  of  the  feature  slots  in  rules  are  “don’t-cares”. 

Several  of  the  formalisms  mentioned  also  contain  other  advanced  mechartisms,  such  as 
feature  disjunction,  feature  negation,  metarules,  and  optional  arguments.  The  occasions  in 
which  we  have  found  the  use  of  negation  of  various  scuts  useful  are  handled  by  the  compiled 
constraint  relations  discussed  in  Section  2.2.1.  Metarules  would  have  been  useful  in  tiie 
earlier  version  of  the  system,  which  used  traditional  subcategcmzation  frames  and,  therefore, 
needed  to  “compile  out”  all  the  possible  variants  of  each  subcategcmxation  frame  into 
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separate  rules.  However,  the  mapping  unit  strategy  for  subcategorizadon  which  we  discuss 
in  Section  3.4,  eliminates  the  need  for  this.  The  mapping  unit  approach  to  subcategorization 
combined  with  the  recursive  structure  of  constituents  discussed  in  Sectitm  2.3  handles  the 
phenomena  normally  treated  by  an  optional  arguments  mechanism.  We  have  found  it  useful 
to  add  a  limited  form  of  feature  disjunction,  which  is  described  in  Section  4.1.1. 


2.2  Syntactic  Constraint  Analyses 


While  constraint  relatitHis  have  long  been  used  in  £>CG-based  work  to  treat,  among  other 
things,  problems  of  quantifier  scoping  and  the  interaction  of  quantifier  scope  and  anaphora 
[73]  [75],  we  have  found  that  there  are  other,  more  low-level  issues  of  syntax  and  lexical 
semantics  for  which  constraint  relations  are  eitl^  concq)tually  useful  or  formally  necessary, 
even  in  unification  formalisms  which  allow  full  disjunction  across  features,  such  as  [52] 
[55]  [54]  [49].  In  particular,  we  discuss  the  constraints  imposed  by  subgrammars  on 
conjunction  and  relaxation  of  subject  verb  agreement  in  this  section,  and  describe  the  use 
of  constraint  relations  for  semantic  interpretation  in  Section  3.6. 

Restrictions  on  agreement  between  features  sometimes  depend  on  the  values  of  still 
other  features.  For  example,  English  noun  phrase  conjunctions  require  that  if  one  of  the 
conjuncts  belongs  to  a  particular  subgrammar  (date,  time  etc.)  then  the  other  conjunct(s) 
must  also  belong  to  that  subgrammar,  but  if  neitiwr  of  the  conjuncts  belongs  to  a  sub- 
grammar,  there  is  no  restriction.  This  can  be  done  by  including  in  the  NP  conjunction 
rule  a  constraint  relation  which  provides  a  caae-arudyzing  effect  that  could  not  be  provided 
merely  by  unifying  the  variables  ranging  over  the  seitumtic  type.  This  function  is  carried 
out  by  NITYPE-FILTER,  which  we  introduced  above.  This  relation  takes  as  its  arguments 
the  NPTYPE  feature  of  the  first  conjunct,  the  NPTYPE  of  the  second  conjunct,  and  the 
NPTYPE  feature  of  the  conjunction  as  a  whole.  This  relation,  then,  can  be  thou^t  of  as 
constraitung  the  possible  triples  of  these  values.'  We  present  four  instances  of  this  rule  for 
illustration  in  Figure  22,  where  »  0”  indicates  production  of  tiie  empty  string. 

The  first  two  rules  require  that  if  either  of  the  ccmjuncts  of  a  ctmjoined  NP  belongs 
to  the  TIMENP  or  DATIB*^  subgrammar,  tiian  the  other  conjunct  must  belong  to  tiutt 
subgrammar,  as  well.  The  last  two  rules  aBow  cndinary  prcmominal  arui  ncm-prcmoiniiial 
NPs  to  conjoin  freely.^  In  this  example,  note  that  while  the  first  two  rules  could  be  collapsed 
into  a  single  rule  utilizing  unifying  variables,  the  third  and  fourth  caimot 

’Alternatively,  it  can  be  viewed  as  computing  a  value  for  the  NPTYPE  feature  of  the  conjunctkn  fitom 
the  values  of  the  individual  conjuncts. 

^They  also  set  the  NPTYPE  of  the  conjunction  to  bdng  non-pronominal,  since  it  cannot  function  as  a 
pronoun. 
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(NPTYPE-FZLTBR  (MOMDlillTMg 

(xromnramF 

(MONONZTNF 

— »  0 

(llPTyPB-FIl.TER  (MOMimiTMg 
(NOHDNZTNP 
(NONQNZSlfP 

—*  9 

(MPTYPB-rZLTBR  (NONONZTW 
(NONDHITHP 
(HORONXmP 

— »  0 


(-PEO  (TXMSHIP))) 
(-PBO  (TXMBBP))) 
(-PRO  (TZMSaiP)))) 

(-PRO  (DRTBaP))) 
(-PRO  (DXmiP))) 
(-PRO  (DRSKNP)))) 

(-PRO  (MZSCMP))) 
(+PRO  :PRO>TXPB)) 
(-PRO  (MXSCHP)))) 


(RPTYPE-FILTER  (NONONZTllP 
(NONUNZTIIP 
(HOMONZTNP 
-►0 


(+PRO  :PRO-TyPE)) 
(-PRO  (MXSCaiP))) 
(-PRO  (MZSCMP)))) 


Figure  2.2:  A  Sample  Constraint  Relation  1:  NPTYPE-FILTER 


Another  use  of  constraint  relations  is  to  “compute”  a  value  from  die  feature  values  of 
the  relevant  constituents,  rather  than  requiring  identity  of  features.  For  example,  in  NP 
conjunction  with  “and”  in  English,  the  person  of  the  conjoined  NP  is  first  person  if  any  of 
the  conjuncts  is  first  person,  second  person,  if  any  of  the  conjuncts  is  second  person  and 
none  is  first  perstxi,  and  third  person  if  all  the  ctmjuncts  are  third  person.^  Tlus  can  be 
easily  handled  by  the  P-MIN  constraint  relation,  which  takes  as  its  arguments  the  PERSON 
feature  of  the  fir^  conjunct,  the  PERSON  of  die  second  conjunct,  and  die  PERSON  feature 
of  the  conjunction  as  a  whole.  It  has  the  solutions  shown  in  Figure  2.3. 


(P-MZM  (1ST) 
(P-MZM  (3BO) 
(P-MIM  (2MD) 
(P-MZM  (2MD) 
(P-MZM  (2MD) 


;P  (1ST))  -»0 
:P  ;P)  ->0 
(1ST)  (1ST))  -0 
(3RD)  (2MD))  -»0 
(2MD)  (2MD)  )  ->  0 


Figure  2.3:  A  Sample  Constraint  Relatitm  2:  P-MIN 

Still  anothm-  case  in  which  constraint  relations  provide  a  kind  of  flexibiliQr  greats  dian 
that  available  using  standard  unification  is  to  allow  the  grammar  to  express  “degrees  of 
grammaticality”.  For  example,  in  standard  written  English,  it  is  common  for  a  verb  to 
agree  in  number  with  its  subject^ 


^Kaittiinen  [S2]  handles  such  cases  with  a  process  of  generalization,  ladier  dian  unification. 

*A1I  of  the  examples  in  this  discussion  of  subject-vert)  agreement  are  taken  from  the  ATTSO  and  ATISl 
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What  do  the  restrictions  represent? 

What  does  restriction  VU/1  mean? 

What  do  the  transport  codes  AL  and  R  mean? 


However,  in  spoken  English,  conjoined  noun  phrases  sometimes  appear  with  singular 
agreement  on  the  verb. 


What  does  RETURN  MIN  and  RETURN  MAX  mean? 
What  does  class  B  and  class  Y  mean? 


In  still  looser  speech,  agreement  disappears  even  with  non-conjoined  subject  noun 
phrases: 


list  all  the  airlines  that  flies  from  Dallas  to  Boston  nonstop. 


These  facts  can  be  handled  by  modifying  the  standard  sentence  rule: 


(BOOT-S  ...) 

(NP  :AGR  ...  :COirJC  ...) 
(VP  :AGR  ...) 


to  the  following: 


(BOOTHS  ...) 

(BP  :A6R  ...  :CGNJC  ...) 

(VP  :AGRX  ...) 

{SDBJBCT-VBBB-JU8BSBNEHT  :AGR  :ACBX  :COIUC} 


and  adding  the  solutions  for  SUBJECT-VERB-AGREEMENT  shown  in  Hgure  2.4.  This 
is  a  constraint  relation  that  takes  as  its  arguments  the  agreement  feature  of  die  sentence’s 
subject  NP,  the  agreement  feature  of  the  sentence’s  VP,  and  the  conjunction  feature  of  die 
subject  NP — ^which  indicates  whether  it  is  a  conjunction  or  not 


cotpoia  collected  by  TI  and  distributed  by  NIST. 


KBN  Systems  and  Ibchnok^ies 


BBN  Rqnrt  Na  7715 


29 


(SUBJSCT-VERB-AGRZBMENT 

(A6R  :» 

:») 

(AGR  :P 
:CONJ) 
-0 

:H) 

(SOBJBCT-VERB-ASRKBMBNT 

(M3t  :» 

(PLURAL)) 

(AGR  :P 

(SZHGDLAR)) 

(4C0IU 

-0 

:CORJT»K) ) 

(SOBJBCT-VBRB-AGRBBMBilT 

(AGR  :P 

(PLURAL)) 

(AGR  :P 

(SnmULAR)) 

(-CONJ) 

0 

) 

(SOBJSCT-VERB-AGBSEMBOT 

(AGR  :P 

(SXH6ULAR)) 

(AGR  :P 

(PLURAL) ) 

(-CONJ) 

-»0 

) 

Figure  2.4:  A  Sample  Constraint  Relatitm  3:  SUBJECT- VERB-AGREEMENT 

The  first  of  these  solutiois  enforces  standard  subject  verb  agreement:  whedier  dre 
subject  NP  is  a  conjunction  or  not,  the  agreement  features  of  the  subject  and  the  VP  must 
be  the  same.  The  second  allows  the  VP  to  bear  die  singular  feature  when  the  subject  is 
plural,  just  in  case  the  subject  is  a  conjunctitm.  The  last  two  rules  allow  the  subject  and 
the  VP  to  disagree^  when  die  NP  is  not  a  ccmjunction.  The  statement  about  the  conjunctitm 
status  of  the  subjea  is  necessary  in  diese  last  two  solutions  to  make  diem  mdiogonal  to 
the  first  two,  so  that  a  single  structure  will  not  be  urmecessarily  analyzed  with  more  than 
one  solution. 

This  mechanism  is  superior  to  simply  not  requiring  a  VP  to  agree  with  its  subject  at  all, 
by  using  distinct  variables  for  the  agreement  features  of  the  subject  NP  and  the  VP,  since, 
given  a  large  enough  corpus,  we  can  auttmiatically  associate  different  probabilities  with 
the  different  solutions  of  this  constraint  relation.  This  is  particularly  useful  in  a  spdcen 
language  system,  since  this  will  allow  all  the  possibilities,  with  some  degree  of  probability, 
but  will  always  prefer  the  most  commtm  solution  and  will  only  choose  a  less  lilrely  sdlution 
if  a  more  comrncm  one  is  unavailable. 


2.2.1  Constraint  Relation  Compilation 


An  analysis  of  the  time  spent  in  parsing  showed  that  substantial  time  was  spent  in  processing 
constraint  relations  via  standard  unification,  and  it  became  clear  dutt  more  efficient  data 
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structures  and  procedural  techniques  could  be  used  to  implement  many  such  computations. 
Therefore,  we  have  developed  a  mechanism,  in  addition  to  our  standard  rule  formalism, 
that  permits  the  creation  of  constraint  relations  that  will  be  compiled  into  executable  LISP 
code.  From  the  stant^roint  of  the  rest  of  the  grammar,  these  constraint  relations  look 
exactly  like  any  other  constituent:  they  return  either  a  substitution  list  if  the  constraint 
succeeds  or  the  designated  symbol  FAIL  if  it  does  not  However,  the  ability  to  define 
elements  in  the  grammar  that  in  fact  are  executed  as  code  means  that,  in  effect  we  have 
a  grammar  consisting  of  declarative  elements  tiiat  can  contain  associated  procedures.  For 
example,  while  the  P-MIN  and  NPTYPE-FILTER  constraint  relations  were  presented  in 
their  current  declarative  form,  we  have  found  it  more  efficient  to  redefine  the  SUBJECT- 
VERB-AGREEMENT  constraint  relation  in  a  procedural  version. 

The  ability  to  define  procedural  elements  as  part  of  grammar  rules  is  crucial  for  the 
recasting  of  the  grammar  using  relational  labels,  discussed  below  in  Section  2.4  and  in 
Section  3.5.  Vlfithout  the  ability  to  attach  procedures  to  our  rules  that  check  for  the  coher¬ 
ence  of  local  syntactic  and/or  semantic  representation,  our  current  system  would  have  been 
impossible. 


2.3  Subgrammar  Development 


In  the  earlier  version  of  the  grammar,  optional  elements  were  handled  in  die  following 
manner.  Special  categories  were  introduced  to  simulate  optionality.  By  conventicm,  such 
categories  had  names  that  began  with  OPT.  These  nodes  were  used  to  implement  optionality 
in  the  following  way.  First,  there  was  always  a  rule  of  the  form: 


OBV<category> 

-0 


that  is,  a  rule  expanding  to  nothing.  The  inclusion  of  such  a  rule  was  what  allowed 
categories  to  be  optional  in  the  first  place.  In  addition,  there  were  always  one  or  more 
rules  of  the  form: 


OBT<category> 
<cat\  >  ... 


where  <cati  >  is  a  category  of  the  grammar. 
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To  allow  multiple  occunences  of  tiie  same  (^tional  category,  such  opti<»al  nodes  were 
produced  recursively: 


OBT<category> 

— f 

<cari  > 

<X'T<category> 


While  we  have  retained  a  few  such  dummy  camgories,  mostly  for  ncm-iterative  opticHud 
categories,  wt  have  developed  another  scheme  tiiat  is  used  quite  generally  throughout  the 
grammar  for  elements  that  may  be  optional  and  freely  ordered.  This  scheme  is  to  make  the 
categories  that  contain  optional  modifiers,  complements,  and  adjuncts  recursive,  and  to  write 
a  small  number  of  rules,  each  one  introducing  a  single  q)tional  constituent  This  device 
immediately  allows  for  optionality,  iteration,  and  free  order  of  elements.  Moreover,  it 
eliminates  an  unpleasant  side-effect  of  our  earlkr  treatment  of  t^cmality.  Optional  positive 
adjectival  modifiers  of  NP  were  introduced  via  the  category  OFTADJP.  Since  various  other 
elements,  such  as  present  participles  C'flying”)  and  passive  participles  Chooked**)  also 
functioned  as  pre-nominal  modifiers,  such  elements  were  “lifted”  to  Adjectives  or  Adjective 
Phrases  by  chain  (unit-production)  rules,  to  allow  them  to  function  as  nominal  modifiers 
and  also  to  give  them  the  correct  semantics.  In  our  current  version  of  tiie  grammar,  all 
prencmiinal  modifiers  are  introduced  by  rules  of  the  form; 


(N-BAR  ...) 

<pre-nomuud-mod^r> 

(N-BAR  . . .) 

<pre-nomnal-modifier>  currently  includes: 

positive  adjective  phrases,  as  in  “a  cheap  flight”. 

common  nouns,  as  in  “the  Boston  flight”. 

unit  noun-phrases,  as  in  “two  hundred  dollar  one  way  tickets”. 

time  expressions,  as  in  “the  3  o’clock  flight”. 

date  expressions,  as  in  “the  November  ninth  flights”. 

passive  participles,  as  in  “a  cot^irmed  reservation”. 
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present  participles,  as  in  “a  departing  flight” 


We  have  adopted  a  similar  scheme  for  die  post-head  modifiers  and  ccnnplements  of  N. 
In  the  case  of  NP,  we  first  created  the  category  CX)RE-NP,  which  cobsists  of  all  the  left 
modifiers  of  an  NP  up  to  and  including  the  heai  C‘the  flights”,  “the  evening  flights”,  “die 
cheapest  evening  flints”,  etc.)  Next,  we  added  the  chain  rule: 


(MP  ...) 
(CORE-UP  ...) 


Finally,  we  added  recursive  NP  rules,  each  adding  one  of  the  types  of  constituents  that 
can  follow  the  head  noun: 

relative  clause,  as  in  “a  flight  tiiat  leaves  cfier  3  PM". 
adjective  phrase,  as  in  “a  flight  non-stop  ham  Boston  to  San  Francisco”, 
present  participial  phrase,  as  in  “a  flight  arriving  b^ore  3  PM". 
passive  participial  phrase,  as  in  “a  flight  deleted  due  to  weather". 
a  **path”  expression,  as  in  “a  flight  Boston  to  San  Francisco". 


One  minor  change  to  the  NP  grammar  that  we  have  introduced  is  the  category  UNTT- 
CORE-NP.  This  is  a  Noun  Phrase  whose  head  is  a  unit  of  measure  (e.g.  “foot”,  “dollar”, 
“pound”,  “year”,  etc.)  and  which  appears  pre-nominally.  In  this  position,  the  head  noun 
remains  singular,  no  matter  the  numbo'  of  its  specifier  compare  “Show  me  the  flight  that 
costs  five  hundred  dollars"  with  “Show  me  the  five  hundred  ^llar  flight”.  This  category  is 
highly  restricted  in  its  distribution,  being  limited  to  the  recursive  N-BAR  rule  introduced 
above  and  an  Adjective  Phrase  rule  that  specifies  that  die  Adjective  Phrase  appears  pre- 
nominally. 

We  have  also  adc^ted  a  similar  scheme  for  the  post-modifiers  and  complements  of 
ADJP.  CORE-ADJP  is  a  much  mwe  impoverished  category  than  CX)RE-NP,  however, 
since  the  left  modifiers  of  Adjectives  are  pretty  much  limited  to  a  type  of  constituent  we 
have  labelled  DEGREESPEC  (words  like  “so”,  “very”,  “too”,  etc.)  and  unit  NPs  C!/ive 
years  old”)  and  UNIT-CX)RE-NPs  (“^v«  year  old”)-  The  right  modifiers  and  complement 
are  also  more  restricted  than  in  NP. 
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Finally,  our  most  spectacular  use  of  this  recursive,  “layers  of  the  onion”  structure  is  in 
the  case  of  subcategorizadon  of  the  arguments  to  V  in  VP,  described  in  full  detail  in  Section 
3.4.  All  together,  these  grammar  changes  combined  with  a  shift  of  most  rules  introducing 
terminal  elements,  such  as  the  preposititMis,  the  determiners,  etc.,  from  the  grammar  to  the 
lexicon  has  resulted  in  a  reduction  in  the  size  of  the  grammar  with  no  loss  in  coverage 
and,  in  fact,  with  an  actual  increase  in  coverage.  The  overall  size  of  the  grammar  has  gtme 
from  over  1 100  rules  to  approximately  450.  In  the  case  of  VP  rules,  we  have  gone  from 
over  80  rules  to  15. 


2.4  The  Use  of  Labelled  Arguments 


While  the  earlier  version  of  DELPHI  utilized  semantic  interpretation  as  a  post-process  on 
a  parse  produced  by  syntax  alone,  during  the  course  of  this  project  we  have  introduced 
semantic  elements  into  the  grammar  and  made  semantic  processing  part  of  the  parsing 
process.  The  full  details  of  these  changes  are  presented  in  Chapter  3.  Here,  we  discuss  just 
one  of  these  changes,  since  it  affects  the  form  of  grammar  rules.  In  the  current  version  of 
the  DELPHI  grammar,  rules  no  longer  consist  merely  of  a  left  hand  side,  which  ctmtains 
a  single  granunatical  constituent,  deriving  a  right  hand  side,  which  is  a  set  of  one  or  more 
grammatical  constituents.  Rather,  each  element  on  the  right  hand  side  is  preceded  by  a 
label,  indicating  its  grammatical  relation  in  (he  structure  proceed.  Such  grammatical  labels 
include  such  familiar  grammatical  relations  as  “head”,  as  in: 


(M-BAR  ...) 

— » 

:BSAD 
(N  ...) 

or  “subject”,  as  in: 

(ROOT-S  ...) 


:  SUBJECT 
(HP  ...) 
:BEAD 
(VP  ...) 
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In  structures  which  do  not  have  a  head  in  the  traditional  syntactic  sense,  such  as 
conjunctions,  a  dummy  head  may  be  inserted,  whose  semantics  is  computed  by  the  syntax 
on  the  basis  of  general  rules.  Thus,  the  NP  conjuncdmi  rule  introduce  above  in  Hgure 
2.1,  has  the  form  shown  in  Figure  2.5  in  the  current  grammar,  where  consumes 

no  input,  but  is  a  compiled  constraint  relatitm  that  produces  the  semantics  for  the  entire 
conjoined  NP.  ^  ^ 

(HP  (AGR  :P  (PLURAL))  (RBALHP  :RSALZ)  :POSS  :1IB  ...  ;DBT-CLASS  ...) 
:AR61 

(NP  (AGR  :PERSOHX  :NUIIX)  (RBALHP  :RBAIJC)  :POSS  :1IH  . . . : DET-CLASSX  ...) 
:CONJ-ELBMBNT  (CGNJ  ...) 

:AR62 

(HP  (AGR  ‘.PBRSONY  :HaiOr)  (RBALHP  iRBALY)  :POSS  :NH  . . . : DBT-CLASSY  ...) 

:BEAD 

{♦HP*} 

{P-MIN  tPERSOHX  iPBRSOHP  :P} 

{NPTYPB-rZLlBR  :RBALX  .-RBALY  :RBALZ} 

Figure  2.5:  A  Grammar  Rule  for  Conjoined  Noun  Phrases 

In  our  current  approach,  we  view  the  parser  not  as  a  device  for  constructing  syntactic 
trees,  but  as  an  information  transducer  that  makes  it  possible  to  simplify  and  generalize 
the  rules  for  semantic  interpretation.  The  purpose  of  syntactic  analysis  in  this  view  is  to 
make  information  encoded  in  ordering  and  constituency  as  readily  available  as  the  infor¬ 
mation  encoded  in  the  lexical  items,  and  to  mt^  syntactic  paraphrases  to  informationally 
equivalent  structures.  The  actual  interface  between  parsing  and  semantics  is  a  dynanuc 
process  structured  as  a  cascade  (as  in  Woods’  notion  of  cascaded  ATNs  [98]),  witit  pars¬ 
ing  and  semantic  interpretation  acting  as  coroutines.  The  input  to  the  semantic  interpreter 
is  a  sequence  of  messages,  each  requesting  the  “binding”  of  some  constituent  to  a  head. 
The  semantic  interpreter  does  not  perform  any  sort  of  recursive  tree-walk  over  the  syntactic 
structure  produced  by  the  parser,  and  is  in  fact  immune  to  many  details  of  the  tree  structure; 
see  Section  3.5. 

The  chief  effort  over  the  last  year  has  been  to  codify  tins  rxrtion  of  binding  or  “logical 
attachment”,  simplifying  the  set  of  such  attachments  to  highli^t  tiie  common  underlying 
substructure  of  grammatical  paraphrases.  To  tiiis  end  we  re-oriented  our  grammar  around 
the  notion  of  “grammatical  relation”.  These  relations  may  be  seen  as  the  end  result  of 
making  the  information  encoded  in  ordering  and  constituency  explicitly  available,  (hr 
languages  with  freer  word  order  this  information  is  often  encoded  in  morphological  affixes 
or  pre  and  post  positions.)  From  the  point  of  view  of  a  syntactic-semantic  transducer, 
the  key  point  of  any  grammatical  relation  is  that  it  licenses  (one  of)  a  small  number  of 
semantic  relations  between  the  (“meanings”  of)  the  related  ctmstituents.  Scmtetimes  the 
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grammatical  relation  constrains  the  semantic  lelaticm  in  ways  that  caimot  be  predicted 
from  the  semantics  of  the  constituents  alcme  (given  “John”,  “Mary”  and  “kissed”,  mily  the 
grammatical  relations  or  prior  world  knowledge  determine  who  gave  and  who  received). 
Other  times  the  grammatical  relation  simply  licenses  the  one  plausible  semantic  relaticHi 
(given  “John”,  “ate”  and  “hamburger^’,  if  diere  is  a  relation,  it  is  the  hamburger  that  is 
most  likely  to  have  been  consumed — but  in  the  sentence  “John  ate  the  fries  but  rejected 
the  hamburger”  our  knowledge  of  the  destiny  of  die  hamburger  is  mediated  by  its  lack  of 
a  grammatical  relation  to  “ate”)- 

^th  this  brief  introduction,  we  now  turn  to  a  more  detailed  view  of  the  semantics  of 
DELPHI. 


Chapter  3 


Semantic  Interpretation 


In  this  chapter  we  describe  the  developments  in  semantics  during  die  course  of  the  project 
A  major  change  horn  our  eariier  woric  is  that  interpretation  in  DELPHI  has  shifted  from 
a  Montague  Grammar  [6S]  style  rule-for-rule  qiproach  to  one  that  carries  out  semantic 
interpretadtMi  ditecdy  during  the  course  of  the  parse,  rather  than  as  a  post-process  on  a 
finished  parse  tree  [24, 25].  Meaning  representations  are  thereby  constructed,  and  semantic 
filtering  constraints  applied,  as  part  of  parsing  the  utterance. 

This  move  has  several  desirable  attributes: 


•  More  information  is  available  as  iiqiot  to  semantic  interpretation,  so  it  is  possible  to 
gain  higher  coverage. 

•  Syntax  and  semantics  are  integrated,  so  semantic  filtering  constraints  can  be  iqiplied 
as  constituents  are  built  and  attached,  which  results  in  more  efficient  processing. 

•  This  integration  is  simple  and  does  not  require  any  ctxnplex  engineering  of  coqier- 
ating  software  modules. 


All  three  are  important  for  a  spoken  language  system. 

Our  chief  aim  has  been  to  make  this  integration  of  sanantic  interpretation  widi  die 
parsing  process  as  efficient,  simple,  and  generalizable  as  possible.  This  goal  has  led  us  to 
restrict  the  use  of  unification  to  just  those  situations  in  which  it  is  the  most  appropriate 
tool,  and  to  employ  constraint  lel^ons,  especially  compiled  ctmstraint  relations,  discttssed 
above  in  Section  2.2.1,  in  many  places.  As  a  case  study  of  the  evolution  of  DELPHI’S 
syntax-semantics  interface  from  a  fairly  vanilla  “unification  semantics”  implementatian,  to 
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the  cunent  state  of  affairs,  summarized  in  Section  3.5,  the  bulk  of  the  rest  of  the  chapter 
is  devoted  to  the  evolution  of  the  treatment  of  verbal  subcategorization  phenomena  during 
the  course  of  the  project  We  conclude  this  discussion  with  an  overview  of  our  general  use 
of  labelled  arguments,  introduced  above  in  Sectimi  2.4,  as  an  efficient  means  of  interfacing 
syntax  with  semantics  in  areas  other  than  subcategorization.  The  earlier  sections  of  this 
discussion  necessarily  refer  to  aspects  of  DELPHI  which  are  no  longer  current  Those  who 
wish  to  omit  the  historical  background  and  justification  for  our  current  approach  can  skip 
directly  to  Sectitm  3.5. 

We  have  also  included  a  section  (3.6)  that  discusses  semantic  problems  which  were 
treated  in  our  earlier,  unification-based  approach  to  semantics.  We  have  included  this 
historical  material,  which  describes  aspects  of  die  system  which  have  been  superseded,  for 
two  reasons.  First,  without  the  necessary  background  explaining  the  problems  with  the 
most  straightforward  strategies  for  integrating  syntax  and  semantics,  the  reasons  for  our 
move  to  the  novel  architecture  of  Section  3.5  might  not  be  apparent  Second,  some  of 
the  solutions  presented  in  the  historical  sections,  such  as  the  (^-TERM  and  NOM-SEM 
structures,  which  we  have  discarded,  may  nevertheless  be  intoesting  as  representations  for 
such  semantic  information.  Moreover,  many  of  the  problems  described  in  these  historical 
sections  must  be  solved  in  any  system  for  semantics,  and  so  are  useful  as  a  reference. 


3.1  Virtual  Rules  and  Subcategorization 


As  a  prelude  to  our  discussion  of  the  treatment  of  subcategorization  in  DELPHI,  we  begin 
with  an  overview  of  some  of  the  standard  treatments  of  subcategorization.  In  complex- 
feature  based  grammar  formalisms  such  as  DELPHI’S,  there  is  no  formally  separate  lexictxi; 
lexical  items  are  introduced  by  phrase  structure  rules  just  as  syntactic  categories  (“non¬ 
terminals”)  are.  For  example,  in  order  for  the  word  “show”  to  appear  in  the  versitm  of 
DELPHI’S  grammar  used  at  the  start  of  this  project,  there  would  need  to  be  a  rule  of  the 
following  sort: 


(V  (NO-COMTRACT)  (DXTRANSXTZVB  (TAXBS-ACTXVS) )  (AGR  :P  :N)  (BASE)) 
(show) 


This  rule  states  that  “show”  is  the  base  form  ( (ZDPARTICXPLK) ),  unspecified  for 
person  and  number  agreement  ( :  P  :  N),  that  it  takes  a  ditransitive  complement  structure 
( (DITRANSITIVE) ),  and  may  appear  only  in  active  constructions  ( (TAKES -ACTIVE) ). 
Note  that  this  rule  introduces  “show”  in  oitiy  one  of  its  uses,  the  ditransitive  (as  in  “Show 
me  the  flights  from  Boston  to  Denver”).  There  need  to  be  analogous  rules  for  its  othf^r 
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uses,  as  well,  such  as  simple  transitive  (as  in  “Show  the  flights  from  Boston  to  Denver”), 
and  Direct  Object  plus  Prepostional  Phrase  (as  in  “Show  the  flights  from  Boston  to  Denver 
to  me.”) 

For  efficiency  purposes,  DELPHI  does  not  store  rules  introducing  lexical  items,  but 
rather  generates  them  as  needed  by  the  parser  on  the  basis  of  information  stored  in  the 
lexicon  and  in  conjunction  with  a  morphology  program  fliat  handles  the  regularly  inflected 
forms*;  such  rules  that  are  created  on  demand  but  not  pemumently  stored  are  often  referred 
to  as  “virtual  rules”.  Thus,  while  die  lexicon  has  no  formal  place  in  our  system,  it  is  used 
as  a  repository  of  lexical  information  (subcategorization,  semantics,  morphology,  etc.)  that 
is  used  to  construct  the  virtual  rules  that  the  grammar  actuaUy  uses. 


3.1.1  Subcategorization 

Wtual  rules  are  also  used  to  ensure  that  a  member  of  a  lexical  category,  such  as  V  (Vsrb), 
appears  with  the  correct  complements.  Complements  are  so  called,  in  traditional  grammar, 
because  they  “complete  the  meaning”  of  a  lexical  item  in  some  way.  For  example,  a 
transitive  verb  requires  a  noim  phrase  to  follow  it:  “Show  the  flights”  is  grammatical 
but  “*Show”  (*  indicates  ungrammaticality)  is  not.  Complements  are  lexically  specified 
in  that  a  given  lexical  item  may  or  may  not  require  (or  permit)  a  particular  category. 
Thus,  non-transitive  verbs  forbid  a  following  noun  phrase  but  may  optionally  permit  other 
complements;  e.g.  ‘“"Delta  flies  Boston”  is  ungrammatical,  but  “Delta  flies  to  Boston"  is 
fine. 

The  set  of  complements  that  a  lexical  item  requires  is  often  referred  to  as  a  subcatego- 
rization  frame.  The  version  of  the  DELPHI  grammar  used  at  the  beginning  of  this  project 
contained  a  rule  for  every  subcategorization  frame  in  which  a  verb  might  appear.  Each 
such  rule  is  “indexed”,  as  it  were,  by  a  mnemoiucally  luuned  feature  that  must  appear  as 
the  value  of  the  subcategorization  feature  of  any  verb  that  can  occur  in  that  firame.  (In  this, 
it  followed  GPSG  [37],  which  uses  a  similar  indexing  scheme.)  The  two  rules  in  Figure 
IndexSubcatFigure  illustrate  this. 

The  first  rule  states  that  a  VP  may  consist  of  a  V  (verb)  followed  by  no  other  comple¬ 
ments  if  the  V  bears  the  feature  (INTRANSITIVE) .  The  second  rule  says  that  a  VP  may 
consist  of  a  V  foUowed  by  an  NP  just  in  case  the  verb  is  specified  as  being  (TRANSITIVE) 
and  is  in  the  active  voice  ( (TAKES -ACTIVE) ).  Clearly,  such  a  treatment  of  subcatego¬ 
rization  has  the  following  properties: 

‘irregularly  inflected  forms  are  listed  in  the  lexical  entry  of  their  base  fonn. 
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(VP  ...) 

(V  ...  (ZimtANSZTZVB  ...)  ...) 

(VP  ...) 

(V  ...  (TKANSZTIVB  ...  (TMCBS-ACTZVS) )  ...) 

(HP  ...) 

Figure  3.1:  Examples  of  *‘Indexed”  Subcategorizaticm  Rules 

1.  It  requires  the  existence  of  a  separate  VP  rule  for  every  subcategorization  frame. 

2.  In  cases  in  which  a  verb  realizes  the  same  semantic  argument  with  different  syntactic 
forms  (e.g.  ‘‘show”  can  realize  the  entity  to  which  an  object  is  shown  as  either  an 
NP  indirect  object  or  a  ‘‘to”  PP),  a  separate  subcategorizatitm  frame  for  each  separate 
realization  must  appear  in  its  lexical  entry. 

3.  In  cases  where  a  verb  takes  a  particular  cmnplement  optionally,  tte  lexical  entry 
for  that  verb  must  specify  separate  subcategorization  frames  for  its  presence  and  its 
absence.  For  example,  in  the  examples  above  ‘‘show”  needed  to  be  specified  as 
TRANSITIVE  for  its  use  without  an  indirect  object  and  DITRANSITIVE  for  its  use 
with  an  NP  indirect  object 

4.  In  cases  where  the  order  of  complements  is  ftee  (as  in  “fly  frcrni  Boston  to  Denver”, 
‘‘fly  to  Denver  from  Boston”),  we  require  either  that  individual  lexical  items  specify 
the  different  orderings  or  that  the  grammar  autcmiatically  provide  multiple  rules,  when 
the  freedom  of  orde^g  is  predictable.  Fot  example,  if  it  is  the  case  fliat  all  verbs  that 
select  two  prepositional  phrases  pomit  them  to  t^jpear  in  eidier  order,  our  grammar 
might  then  contain  two  versions  of  the  rule  introducing  double  PP  complements,  as 
a  sort  of  ‘‘lexical  redundancy  rule”: 


(VP  ...) 

(V  ...  (DITRANSPRBP  :PRBP  :PREP1  ...)  ...) 
(PP  :PREP  ...) 

(PP  :PREP1  ...) 

(VP  .  ..) 

— * 

(V  ...  (DITRANSPBSP  :PRBP  :PBEP1  ...)  ...) 
(PP  :PRBP1  ...) 

(PP  :PREP  ...) 
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The  first  property  entails  that  the  number  of  subcategorization  frames  will  be  large 
if  the  number  of  actually  occurring  subcategorizations  is  large.  In  the  farmer  version  of 
the  DELPHI  grammar,  there  were  on  the  order  of  45  subcategorization  frames.  Since  our 
grammar  formalism  does  not  include  meta-rules  (a  mechanism  for  deriving  related  versions 
of  rules),  there  were  over  80  actual  subcategorizatim  rules,  to  accommodate  alternative 
forms,  such  as  passive. 

Note  moreover  that  this  indexing  approach  to  subcategodzatitxi  requires  that  the  gram¬ 
mar  contain  a  separate  rule  even  for  subcategorization  frames  that  are  associated  widi  a 
single  lexical  item.  For  example,  in  English,  only  the  verb  “bet”  seems  to  talm  two  NP 
complements  followed  by  a  clause  (as  in  “We  bet  him  $5  that  he  could  not  diink  of  such 
a  verb.”)  As  a  dual  of  this,  if  we  encounter  a  subcategtvizatiai  pattern  we  have  not  seen 
before,  we  must  introduce  a  new  subcategorization  rule,  otherwise  we  cannot  handle  it 

As  the  interacdons  between  the  opdonality  of  a  verb’s  arguments  and  their  freedom 
of  ordering  becomes  complex,  the  number  of  separate  subcategorizadons  it  is  required  to 
specify  can  become  unwieldy  and  hard  to  maintain,  since  changes  to  what  is  essentially 
the  same  meaning  must  prt^agate  to  all  the  subcategorizadon  frames. 

Finally,  this  approach  to  subcategorizadon  assumes  a  rigid  distinction  between  comple¬ 
ments  (which  are  lexically  specified)  and  adjunct  modifiers  (which  are  usually  treated  as 
free).  For  example,  in  general  English  “fly”  is  considered  a  verb  of  motion,  winch  requires 
a  destination  complement  in  the  form  of  a  “to”  PP,  takes  an  optional  source  onnplement 
in  the  form  of  a  “from”  PP,  and,  like  all  active  verbs,  permits  qiticmal  time  modifiers 
such  as  “at  5  PM”.  However,  in  a  travel  domain  such  as  ATIS,  which  involves  scheduled 
travel  times,  such  “free”  adjimcts  occur  much  mrae  frequendy  than  in  standard  Eng^h  and, 
moreover,  are  given  a  domain  particular  interpretation  (e.g.  the  scheduled  departure  time 
of  a  flight,  not  the  mete  accidental  time  at  which  that  flight  hai^ned  to  be  flying).  Hence, 
there  seems  to  be  more  structure  and  regularity  to  the  interpretation  of  adjunct  ttKxlifiers 
than  the  strict  complement-adjunct  distinction  allows. 

Because  of  these  facts,  we  have  moved  to  a  labelled  argument  approach,  which  provides 
greater  flexibility.  The  test  of  this  chapter  uses  the  ^velopment  of  our  treatment  of 
subcategorizatitm  as  an  example  of  how  we  integrated  semantic  processing  with  syntactic 
processing  in  the  most  efficient  manner. 


3.2  A  Unification  Semantics  IVeatment  of  Subcategorization 


Our  first  attempt  at  merging  semantic  interpretation  with  parsing  was  to  carry  out  semantic 
interpretation  direcdy  in  the  unification  grammar  rules  themselves.  This  was  accomplished 
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by  adding  semantic  features  to  the  grammar  rules,  placing  dmn  <m  die  same  footing  as  die 
existing  syntactic  features.  First,  consider  die  following  grammar  rule,  an  example  of  the 
“indexing”  approach  to  subcategorizadon  already  discussed. 


(VP  (juat  :P  :N)  :MOOO  (WB-)  :TRX  zTSlY) 

-4 

(V  .'CONTRACT  (XRANSZTZVE)  :P  :N  :iaOGO) 
(RP  rNSUBCATFRAMB  (NH-)  :TRX  :TRY) 


This  rule  states  that  a  VP  may  consist  of  a  transidve  verb  ( (V  . . .  (IRANSZTZVS)  . . . ) ) 
followed  by  an  NP. 

Next,  we  demonstrate  how  we  added  semandc  features  to  this  VP  rule.  These  new 
features  are  underlined: 


(VP  (ACR  :P  :H)  :liOQO  (NB-)  :TBX  :TRY  rSOBJ  rllPFl 

-4 

(V  (TRAMSZTZVB  ;BFF  iSBBJ  ;OBJ)  :P  :H  :IIOOO) 

(HP  iNSUBCATFRAMB  (BB-)  :TRX  '.TRY  ;OBJ) 


This  rule  passes  up  a  fcmnula  as  the  semantics  of  the  VP,  indicated  by  the  variable 
:NFr.  The  semantics  of  the  subject  of  the  clause,  indicated  by  die  variable  :SUBJ,  is 
passed  down  to  the  verb,  as  is  die  semantics  of  die  object  NP,  indicated  by  the  variable 
:OBJ. 


For  the  transidve  verb  “show”,  we  have  die  following  lexical  rule; 


(V  (TBANSXTZVB  (SBON'  :80BJ  :OBJ)  :SOBJ  :OBJ)  :P  :N  iMOOO) 
(Show) 


We  can  think  of  this  rule  in  funcdonal  terms  as  taking  semandc  arguments  :  80BJ  and 
:  OBJ  and  returning  as  a  value  the  wff  (well-famed  formula)  (SHOW'  :  SDBJ  :  OBJ) . 
Note  the  placement  of  semandc  arguments  to  the  verb  inside  the  subcategorizadon  tenn 
(headed  by  the  functor  TRANSZTZVX)  instead  of  at  the  top-level  of  V.  This  nwans  duu  a 
verb  with  a  differing  number  of  arguments,  such  as  “give”,  has  a  different  subcategoizadon 
functor  with  a  coitesponding  number  of  argument  places  for  the  semandc  translations  of 
these  arguments. 
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3.2.1  The  Use  of  Constraint  Relations  for  PP  Interpretation 


The  facility  for  case  analysis  provided  by  constraint  relations  for  syntax  (Section  2.2)  is 
also  used  in  semantic  interpretation,  where  the  meaning  representation  of  constructions  are 
computed  in  terms  of  values  of  certain  features  which  carmot  always  be  known  in  advance. 
As  a  good  example  of  the  utility  of  constraint  relations  for  semantic  interpretatitm,  let  us 
consider  the  interpretation  of  Prepositional  Phrases.  Prepositional  Phrases  in  botii  post- 
copular  and  other  positions  within  VP  are  very  commcm  in  most  donutins,  including  ATIS. 
Some  examples  include: 


Which  flights  are  on  Delta  Airlines 
Which  flights  are  on  Thursday 
Which  flights  are  after  4  pm 


The  grammar  rule  generating  a  post-copular  PP  in  our  first  introduction  of  semantic 
features  was: 


(VP  :SnBJ  :NFF) 

—4 

(V  (BE)) 

(PP  :PP) 

{PREDICATZVB-PP  :PP  :8VBJ  :1IPF} 


PREDICATIVE-PP  is  the  constraint  relation  in  the  rule.  It  is  responsible  for  specifying 
the  formula  meaning  of  the  VP  in  terms  of  the  translation  of  the  PP  ( :  PP)  and  the  translation 
of  the  subject  passed  down  from  the  clause  (:  SUBJ). 

The  PREDICATTVE-PP  solution  for  the  “flight-on-airline”  sense  is  as  follows: 


(PBBOZCATZVB-PP  (PP-SBM  (:0K  (ON)  (ABOARD)  (ONBOARD))  (:I1P  AZRLZNB)) 
(:SOBJ  nZCBT) 

(BQDAX.  (FLZGBT-AIRLZNB-Or  :SOBJ)  :NP) ) 

-0 


The  first  occurrences  of  the  variables  :  NP  and  :  SUBJ  above  are  paired  with  semantic 
types  AIRLINE  and  FLIGHT;  this  is  shorthand  for  the  actual  state  of  affairs  in  which  a 
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term  rqnesenting  a  padcage  of  infonnaticm  including  mcoding  of  semantic  type)  tsppean 
in  the  slots  of  the  rule  these  variables  occur  in: 


(Q-TBBM  (gomrZFZBR)  (VMtZABLB) 

(NGM  (PAMMETBR)  (SET)  (SORT))) 


This  structure  is  so  constructed  as  to  not  unify  with  anodier  such  structure  if  its  semantic 
type  is  disjoint,  using  a  method  for  otcoding  sonandc  types  as  terms  described  in  Sectim 
3.6.2.  For  full  descr^tions  of  the  Q-TERM,  NOM,  and  FP-SEM  functors,  whidi  are  not 
used  in  our  current  work,  see  Sectitm  3.6.1. 

The  PP  translation  is  also  a  package  with  die  functor  PP-SEM,  omtaining  the  preposition 
and  the  translation  of  the  NP  object  of  the  PP.  No  local  attempt  is  made  to  translate  die 
preposition. 

When  parsing  with  the  above  predicate  PP  rule,  the  system  searches  duou^  a  database 
of  PREDICATTVE-PP  solutions  like  die  above,  much  as  a  PROLOG-based  system  would. 
If  a  solution  successfully  unifies,  the  formula  is  passed  up  as  die  translation  of  the  VP. 

Constraint  relations  are  used  not  only  to  impose  a  stipulation  on  the  constituents  of  a 
rule  but  also  to  allow  for  multiple  ways  to  satisfy  these  ctmstituents.  For  example,  the 
PP  “on  American  Airlines”  can  apply  differently  to  diffnent  NPs:  to  an  NP  headed  by 
“flight”,  in  which  case  it  indicates  that  die  fli^t  is  an  American  Airlines  flight,  or  to  an  NP 
headed  by  ’Tare”,  in  which  case  it  indicates  that  fere  is  the  fare  of  an  American  Airiines 
flight 

So  far  we  have  not  indicated  how  the  system  would  distinguish  between  these  two 
cases:  in  other  words,  how  it  would  tell  a  fare  and  a  flight  apart  The  variables  :  SOBJ 
and  :OBJ  in  the  previously  presented  lexical  rule  for  “hire”  are  typed  to  range  over  the 
Q-TERM  structures  that  represent  noun  jduiise  semantics,  which  we  introduced  earlier. 

As  an  example,  we  give  a  sectaid  version  of  die  rule  for  “show”  in  Hgure  3.2,  diis  time 
incorporating  tte  selectional  restrictiem  that  a  person  shows  a  fli^t  one  of  the  possible 
transitive  senses  of  “show”.  The  use  of  die  numbers  “1”  and  “2”  above  is  intoMled  to 
indicate  the  multiple  occurrences  of  die  complex  forms  diey  label.  (Note  diat  diis  is  simply 
the  Ccmimon  Lisp  [94]  ctmventkxi  for  le-oitrant  list  structure  in  die  rale  above.  This  is 
at  present  only  used  for  notational  compactness;  the  system  does  not  cuiraitiy  attempt  to 
talm  computational  advantage  of  re-enttancy  during  unification  or  odier  processing.) 
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(V  (TRJVNSZTIVB 

(SHOW'  :Q1  (HQM  :PMIS1  :SET1  (PESSOV) ) ) 

i2«(Q-TSBM  :Q2  :VMt2  (HQM  :PMtS2  :SBT2  (ZHMtZMltSX  (IXZCTT) ) ) ) ) 

*1* 

#2* 

:P  :N  :liCX30)) 


(show) 


Figure  3.2:  Unification  Sonantics  for  “show” 

3.3  The  TVeatment  of  Optional  Complements 


While  the  introducticm  of  semantic  features  into  our  standard  “indexing”  treatmrat  of 
subcategorizatioo  provided  a  basic  semantics,  it  did  not  solve  any  of  the  problems  relating 
to  optitMiality  of  complements,  freedom  of  ordering,  and  alternative  realizations,  t»duch 
we  discussed  in  Sect^  3.1.1.  We  now  turn  to  a  furtiier  develtqmient  of  our  treatment 
of  subcategmizaticm:  the  introductitm  of  mechanisms  to  handle  t^tional  complements  in  a 
more  efficient  way  than  simply  enumerating  subcategorization  features  in  individual  lexical 
items.  Recall  our  initial  treatment  of  the  semantics  of  subcategorization  as  exemplified  in 
the  following  rule: 


(VP  :SUBJ  :iinr) 

(V  (TRMtSZTZVB  :1irr  :SOBJ  :OBJ)) 
(HP  :QBJ) 


The  semantics  of  the  subject  of  the  sentence  is  passed  down  throu^  the  :  SUB  J  variable 
to  the  V,  alcxig  widi  the  semantics  of  die  complements.  The  V  in  turn  passed  back  up  the 
formula  representing  the  semantics  of  the  whole  sentence,  tiiroogh  die  :1I1T  variable. 

As  pdnted  out  above  in  Section  3.1.1,  dus  mechanism  requires  one  VP  rule  for  evoy 
subcategorization  frame  one  wants  to  handle,  and  dus  can  require  many  rules.  A  mote 
serious  problem,  however,  arises  in  die  case  of  optional  complements  to  verbs,  as  seen  in 
dw  following  actual  AITS  training  sentences  that  use  die  verb  “arrive”: 


Show  me  all  flights  from  Boston  to  Denver  that  arrive  before  S  PM 
Show  me  flights  ...  that  arrive  in  Baltimore  before  noon 


BBN  Systrais  and  Technologies 


BBN  Report  No.  7715 


45 


Show  me  all  the  nonstop  flights  arriving  from  Dallas  to  Boston  by  10  PM  ... 
Show  me  flights  departing  Atlanta  arriving  San  Francisco  by  5  PM 
Show  me  flights  arriving  into  Atlanta  by  10  PM  frcan  Dallas 


We  see  here  that,  in  addition  to  the  temporal  PP  that  always  accompanies  it,  “airive”  can 
be  followed  by  (1)  nothing  else,  (2)  a  PP  with  “in”,  (3)  a  “firom”-PP  and  a  “to”-PP,  (4)  a 
bare  noun  phrase,  or  (5)  an  “into”-PP  and  a  “from”>PP.  The  princq>le  pattern  diat  emoges 
is  one  of  complete  optionality  and  independence  of  order.  Indeed,  in  the  fifth  example, 
the  temporal  PP,  which  might  be  more  traditionally  regarded  as  an  adjunct  rather  than  a 
complement,  and  thus  as  one  of  die  siblings  of  the  VP  rather  than  one  of  its  constituents, 
is  instead  interposed  between  two  PPs  complements,  making  the  adjunct  analysis  rather 
problematic.^ 

The  only  way  the  subcategorization  scheme  presented  above  could  deal  with  these 
variations  would  be  to  enumerate  them  all  in  separate  rules.  But  this  would  clearly  be 
infeasible.  The  initial  soluticm  to  this  problem  that  we  adopted  constructed  a  right-branching 
tree  of  verbal  complements,  where  the  particular  constituents  admitted  to  this  tree  are 
controlled  by  constraint  relations  keying  off  the  lexical  head  of  the  verb.  There  were  two 
main  rules,  which  are  shown  in  Figures  3.3  and  3.4.  The  rule  in  Figure  3.3  generates  any 
number  of  optional  PP  complements  to  a  verb  that  otherwise  takes  no  other  complements, 
while  that  in  Figure  3.4  generates  any  number  of  optional  PP  complements  to  a  verb  that 
takes  an  initial  NP  complement 


(VP  :SUBJ  (AND  :ZNXTZAL-1IFF  rCOMP-NFF)) 

(V  :LEX  (INTNANSOPTCOMPS  :SUBJ  :  XNZTZAL-IHT)  ) 
(OPTCOMPS  :LEX  :SDBJ  (DUMMY)  :C0NP-1IFF) 

Figure  3.3:  Optional  Intransitive  PP  Complements 


(VP  :SDBJ  (AMD  :  ZNXZZAL-IIFr  :  COMP -IDT)  ) 

—¥ 

(V  :Z.BX  (TRAMSZTZVBOPZOJMPS  :SUBJ  zOBJ  :  ZNZTZAL-MBT) } 
(MP  :OBJ) 

(OPTCOMPS  :ZXX  :SOBJ  :0IBJ  rCOMP-UFT) 


Figure  3.4:  Optional  Transitive  PP  Cmnplements 

^To  see  that  these  PPs  are  truly  associated  with  the  verb  radier  than  somehow  modifying  die  subject 
flight-NP,  one  need  only  rqilace  the  subject  with  the  pronoun  “it”. 
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The  category  OFTCOMPS  generates  a  right-branching  tree  of  optitmal  PP  ccxnple- 
ments.  Consistent  with  the  use  of  such  dummy  categories  to  introduce  optional  elements, 
as  discussed  in  Section  2.3,  OPTCOMPS  is  e]q>anded  by  two  rules. 


(OPTCOMPS  :LBX  :SDBJ  :OBJ  (AND  :1IFF1  :1irF2)) 

— * 

(PP  :PP) 

(OPTCOMPS  :1BX  '.SOBJ  :QBJ  :1I8T1) 

{OPTCOMP-PP  :LBX  :SOBJ  :OBJ  :PP  :1IFr2} 


introduces  the  optional  PP  complements,  while 


(OPTCOMPS  :£BX  :StIBJ  :OBJ  (TRDB))  0 


expands  to  nothing,  allowing  the  rationality. 

OPTCOMP-PP  is  die  constraint  relatkxi;  it  keys  off  the  lexical  head  of  the  verb  (pro¬ 
vided  as  the  binding  of  the  variable  :LEX)  and  ccunbines  die  subject,  object,  and  PP 
complement  translations  to  produce  the  crmtribudon  of  die  PP  ccmiplement  to  the  firud 
formula  that  represents  the  sentence  meaning.  An  arbitrary  number  of  PP  cmnplements 
are  provided  for  by  the  recursion  of  the  first  role  above,  which  bottoms  out  in  die  case  of 
the  secrmd  rule  when  there  are  no  mote  complements.  Ruasal  types  other  dian  PPs  are 
accommodated  by  similar  rules. 

The  solution  for  PP  complements  to  “arrive”  such  as  “in  Atlanta”,  “into  Baltimore”  “at 
Denver”  etc.  follows: 


(OPTCGMP-PP  (ABNZVB) 

(:SOBJ  typm  IXXQBT) 

(IMP  typo  CZTT) 

(ten  (INTO)  (IN)  (AT)) 

(SgOAL  (DBSTZMATXON-CZTy  :SOBJ)  :HP)) 


This  rule  says  that  for  a  flight  to  “arrive”  INTO,  IN  or  AT  a  city  means  that  die  city 
equals  the  value  of  die  flight’s  DESTINATION-CITY  attribute.  Semantic  type  informatkm 
is  here  notated  with  a  shorthand  keyword  typo;  in  the  actual  system  a  putially-specified 
term  that  packages  semantic  type  information  in  a  specific  slot  was  unified  into  such 


BBN  Systems  and  Technolt^s 


BBN  Report  No.  7715 


47 


variables  as  :  SUB  J,  :  OBJ,  and  :  HP.  Note  also  the  use  of  disjunction  (via  :  OR)  to  combine 
different  prepositions  together. 

While  the  OPTCOMPS  approach  to  complementation  solved  the  problem  of  optitmality 
and  free  order  among  complements,  it  did  not  provide  a  solution  to  the  problem  of  altnnate 
syntactic  realizations  of  the  same  semantic  argument,  such  as  the  reali^tion  of  an  indirect 
object  either  as  an  NP  or  a  “to”  PP,  which  was  discussed  in  Section  3.1.1.  In  the  next 
section,  we  present  a  new  mechanism  which  solves  all  these  problems. 


3.4  The  Mapping  Unit  Approach  to  Subcategorization 


In  this  section,  we  introduce  the  “mapping  unit”  approach  to  representing  subcamgoriza- 
tion  information.  The  advantage  of  this  approach  over  the  basic  indexing  treatment  and 
the  OPTCOMPS  mechanism  lies  in  its  flexibility,  a  flexibility  v^ch  in  turn  offers  greater 
robustness  of  coverage  with  respect  to  unanticqrated  variatitms  of  a  verbal  argument  pat¬ 
tern,  and  easier  extension  of  coverage  to  new  pattmns.  It  handles  in  a  quite  natural  way 
complement  order  variation,  opdonaUty  of  complements,  alternative  syntactic  realizations 
of  arguments,  and  metonymy.  This  is  essentially  the  approach  to  representing  subcatego¬ 
rization  information  which  we  currently  employ,  although  the  use  of  generalized  relational 
labels  discussed  in  Section  3.5  has  changed  the  form  of  the  actual  rules  used  and  tte  way 
in  which  semantic  information  is  represented  and  combined. 


3.4.1  Previous  Approaches 


Many  past  approaches  have  sought  to  represent  subcategOTization  declaratively,  often  using 
an  approach  based  on  the  unification  of  feature  values.  Such  approaches  as  Definite  Clause 
Grammar  [74],  Categorial  Grammar  [2],  PATR-n  [87],  and  lexicalized  TAG  [81]  include 
in  one  form  or  another  a  notion  of  “subcategorization  frame”  that  specifies  a  sequence  of 
complement  phrases  and  constraints  on  them.  Stnne  have  also  advocated  using  the  feature 
system  to  encode  semantic  information  (as  for  example  [66]),  and  this,  as  was  seen  above, 
was  our  own  initial  approach  to  subcategorization. 

“Mapping  unit”  subcategorization  is  partly  inspired  by  these  approaches,  but  it  handles 
several  kinds  of  variation  in  natural  language  utterances  which  cause  difficulty  for  them. 
These  fcrnns  of  variation  are  not  in  any  sense  marginal  phenomena,  but  are  instead  repeat¬ 
edly  seen  in  naturally  derived  data,  such  as  that  for  tire  ATIS  SLS  common  task  domain. 
The  phenomena  fall  into  three  classes: 
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The  first  is  variation  in  argument  order,  as  seen  in 


fly  from  Denver  to  Boston 
fly  to  Boston  from  Denver 


Such  variation  can  be  handled  by  the  frame  approach,  but  only  at  the  cost  of  specifying 
one  frame  for  each  <xda.  Besides  such  lexically-specific  variation  of  (nder,  odier  sources 
of  order  variation  include  interptdation  of  elements  tradidonally  considered  adjuncts  (“What 
flights  leave  at  3  pm  from  Denver**)  and  heaviness  effects  (“Show  tm  tihe  screen  die  fares 
and  departure  times  of  all  the  flights  from  Bostem  to  Dallas**). 

The  second  is  the  optionality  of  arguments,  and  the  different  consequences  thereof, 
including  zero  anaphora; 


“What  restrictions  apply?** 

(«  apply  TO  SOMETHING  IN  CONTEXT) 


default  value: 


“Show  the  flights.*’ 

(*  show  the  flights  TO  ME) 


existential  quantification: 

“fly  to  Boston**  (AT  WHATEVER  TIME) 
and  independent  truth-conditions: 

’The  Rockettes  kicked.** 

Each  verb  that  has  optional  arguments  tends  to  have  different  preferences  for  what  to  do 
with  the  omitted  argument  places,  as  the  above  examples  make  clear.  The  frame  iqiproach 
can  handle  them,  but  again  only  at  the  cost  of  specifying  multiple  frames. 

The  third  and  final  type  of  variation  is  die  metonymic  coerckm  of  arguments,  as  seen 
below; 
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“What  wide<body  jets  serve  dinner?” 

(=  “What  FLIGHTS  <m  widc-body  jets  serve  diniwr?” 
aircraft  themselves  do  not  “serve  meals”) 

“What  airlines  Sy  to  Dallas?” 

(=  “What  airlines  HAVE  FUGHTS  to  Dallas?” 
airlines  diemselves  don’t  “fly”) 


In  both  examples  thoe  is  a  superficial  clash  of  types  which  is  meant  to  be  lectmciled 
through  the  interposition  of  an  implicit  binary  relation  between  the  objects  having  diose 
types.  Our  woric  postulates  a  distinction  between  two  lands  of  metonymy:  “referential”, 
where  the  argument  is  taken  to  be  an  indirect  reference  to  an  object  of  the  proper  type,  and 
“predicative”,  where  only  the  argument  slot  of  the  predicate  is  coerced  and  the  referent  is 
taken  literally.  This  distinction  will  be  discussed  in  more  detail  below. 

Most  verbs  in  the  ATIS  corpora  C‘fly”>  “arrive”,  etc.)  have  flight,  source,  destinatiai, 
time  of  day,  and  day  of  the  week  arguments,  most  of  which  are  not  obligatory  and  can 
occur  in  almost  any  order.  The  number  of  frames  necessary  is  combinatorially  impractical, 
and  to  this  situation  the  phenomenon  of  metrmymic  coercion,  which  makes  the  variation 
potentially  open-eruled,  only  provides  the  final  blow.  A  fundamentally  different  frameworit 
from  that  of  subcategorization  frames  is  needed. 


3.4.2  The  “Mapping  Unit”  Information  Structures 


The  central  idea  of  the  mapping  unit  tqrproach  is  that  there  are  several  different  types  of 
subcategorization  constraints,  which  ought  to  be  represented  as  separate  constraints,  radio’ 
than  enumerating  the  “Cartesian  product”  of  their  possible  combinations  in  fixed  patterns. 

The  basic  building  block  is  the  “mtqiping  unit”,  a  structure  which  represents  the  ctm- 
straints  on  a  particular  phrasal  argument  and  the  contributira  this  argument  makes  to  the 
semantics  of  the  clause.  Mapping  units  do  not  “know”  whetlwr  they  are  opticxial  or  not, 
or  in  what  order  they  occur  with  respect  to  other  mapping  units;  diis  information  is  instead 
represented  in  the  grammar  and  in  a  larger  structure  c^ed  a  “map”,  of  which  die  miqiping 
unit  is  a  component 

Figure  3.5  gives  an  an  example  of  a  mapping  unit  Each  mapping  unit  has  die  four  com¬ 
ponents  shown  here:  a  grammatical  relation  (SUBJECT,  DIRECT-OBJECT,  OTHER-PP 
etc),  a  syntactic  p^tem,  a  type  requirement  and  semantic  role  information.  The  syntactic 
pattern  is  a  unification  pattern,  and  thus  retains  all  the  advantages  of  being  able  to  han¬ 
dle  partial  information  that  are  associated  with  unificaticm.  The  syntactic  pattern  (in  the 
example  just  NP)  also  includes  slots  for  semantic  translation  (:  trana). 
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SUBJECT 
(HP  :trans) 

(FLIGHT  :tran«) 

(«  FLZGBT-Or  :tsans) 


Figure  3.5:  An  Example  Mapping  Unit 

The  semantic  type  requirement  (here  FLIGHT)  is  also  enforced  by  unificadmi.  but 
separately,  so  that  a  failure  due  to  semantic  type  clash  can  be  distinguished  foxn  one 
that  violates  syntactic  cmistraints.  In  tiiis  way,  if  the  type  requirement  is  ix)t  met,  the 
mapping  unit  can  be  metonymically  coerced,  filling  the  semantic  role  not  with  the  original 
complement  translation  but  with  an  indefinite  object  related  to  it  via  a  binary  relatkm  that 
resolves  the  clash.  For  example,  the  unit  above  could  be  coerced  to  accept  an  object  of 
type  AIRLINE  via  the  binary  relation  FLIGHT-AlRLINE-0^,  which  maps  flights  to  their 
airlines,  thus  handling  the  utterance  **What  airiines  fiy  to  Dallas”. 

A  final  separate  representation  is  the  contribution  the  semantics  of  the  argument  makes 
to  the  semantics  of  die  clause,  which  is  indicated  by  a  semantic  role  constraint  The  role 
is  set  (with  the  equality  symbol  ■■)  in  the  case  of  ordinary  complement  arguments  and 
restricted  (with  <)  in  foe  case  of  temporal  or  locative  adjunct  modifiers,  which  are  not 
restricted  to  any  pre-specified  number  of  occurrences;  e.g.  “in  Harvard  Square  at  Out  of 
Town  News  next  to  foe  foreign  magazine  sectitm”.  A  semantic  role  can  only  be  set  once 
in  any  given  clause  (which  of  course  does  not  exclude  it  from  being  set  to  a  conjunctive 
element),  but  can  be  restricted  arbitrarily  many  times. 

The  mapping  units  are  combined  in  a  larger  structure  called  a  “map”.  Hgure  3.6 
presents  a  (much  reduced)  example  for  the  verb  “fly”,  in  foe  AUS  domain.  Every  m^  has 
four  components; 


1.  a  labeled-argument  predicate  with  typed  roles 

2.  a  collection  of  “rmqrping  units” 

3.  a  ctHnpletitm  coodititm 

4.  a  translatitm  rule  for  foe  labeled-argument  predicate 


The  labeled-argument  predicate — in  the  example  'FLYl* — ^is  tte  representation  of  die  vob’s 
“meaning”,  arul  has  an  assigned  set  of  typed  semantic  roles  which  can  appear  in  any 
applicatitm  of  foe  predicate  (but  which  are  not  necessarily  required  to  iqrpear  in  every 
application). 
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((FLYl  mGBt-C9  IXZ6BT 
ORZG-CXTY  CITY 
DBST-CZTY  CITY) 

SUBJECT 
(MP  :trans) 

(FLICTT  :trans) 

(s  FLZ6BT-0r  :trana) 

OTBER-PP 

(PP  (PBOM)  (HP  :trans)) 

(CITY  :trans) 

(«  0RI6-CITY  :trans) 

OTBER-PP 

(PP  (TO)  (MP  :tran«)) 

(CITY  ttrana) 

(b  DEST-CITY  :tran«) 

oooplotlon  (END  (PIZXED  rLKar-OP  SEST-CZTY) 

(FZLLED-OR-JUnPBOil  0RI6-CZTY)  ) 
translation  (p-and  (Ylight-dast  PLZGBT-OP 

DEST-CITY) 
(£ligfat-orlg  ELKST-Or 
QRI6-CITY) 
(fligfat-dspartvra-tlM 
fXZQBT-Or 
TIMB-Cr-DAY)  )  ) 


Figure  3.6:  A  Example  Mq) 

The  completion  condition  must  be  satisfied  by  any  complete  clause  with  the  verb  as 
head,  and  includes  requirements  on  the  instantiation  of  semantic  roles.^  In  die  example  map, 
the  FILLED  completion  predicate  requires  that  the  roles  FLIGHT-OF  and  DEST-CITY 
filled  by  a  literal  argument  to  the  verb,  while  the  FILLED-OR-ANAPHOR  completion 
predicate  allows  die  role  ORIG-CITY  to  be  be  implicidy  filled  by  a  discourse  oidty.  This 
condition  allows  **What  flints  fly  to  Denver  from  Boston?”,  flights  fly  finom  Boston 
to  Denver^’  and  “What  flights  fly  to  Denver?”  but  forbids  “What  flints  fly?”. 

Other  completion  predicates  include  FILLED-OR-DEFAULT,  which  specifies  a  default 
value  for  a  role,  FILLED-OR-EXISTS,  which  generates  an  existential  quantification  over 
the  range  type  if  the  role  is  unfilled,  and  GRAMMAR-REL-FILLED,  which  requires  diat 
a  particular  grammatical  relation  have  been  assigned.  The  unqualified  c^tionality  of  a 

^This  leqiniement,  in  effect,  implements  the  Functional  Completeness  Condition  of  LFO  [28]. 
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semantic  role  is  indicated  simply  by  leaving  it  out  of  the  completion  conditions.^ 

The  fourth  map  component,  the  translation  rule,  converts  labeled-argument  predicate 
applications  into  ordinary  logic  expressions  based  on  the  roles  they  instantiate.  Its  separa¬ 
tion  from  the  rest  of  the  map  avoids  duplicate  specification  of  the  details  of  logical  form 
construction. 

This  last  point  is  important  because  a  map  can  have  multiple  units  on  the  same  semantic 
role  to  represent  multiple  syntactic  realizations  of  it  For  example,  ihe  utterance  ‘*fly 
DENVER  to  Boston”  (meaning  ‘ily  FROM  DENVER  to  Boston”)  can  be  accommodated 
simply  by  adding  the  following  unit  to  the  map  above; 


OTBER-MP 
(MP  rtzans) 

(CITY  :trans) 

(s  ORZG>CZTY  :trans) 


and  the  function  of  the  translation  rule  is  exactly  the  same. 


Any  semantic  role  can  be  filled  only  once,  so  that  overgeneration  fiom  multiple  mapping 
units  assigning  the  same  role  is  prevented.^  For  example,  given  the  m{q)ping  unit  for 
OTHER-NP  just  presented  and  the  original  map  for  FLYl  already  presented  “fly  Boston 
fipom  Denver”  would  be  ruled  out  since,  given  die  existing  mtqiping  units,  both  ‘^Boston" 
and  “firom  Denver”  would  be  attempting  to  fill  the  ORIG-CITY  semantic  role.  Note  that 
“fly  Denver  from  Denver”  is  ruled  out  fm  exactly  the  same  reason,  even  though,  pre- 
theoretically,  the  “same”  entity,  Denver,  is  being  used  to  fill  the  same  semantic  role  twice. 
Note,  then,  that  the  constraint  against  multiple  fillers  for  a  single  semantic  role  is  a  constraint 
on  the  mapping  from  syntactic  elements  to  semantic  roles,  and  not  directly  on  that  frmn 
semantic  entities  to  semantic  roles. 


Ortain  grammatical  relations  can  also  be  assigned  only  once  in  the  derivation  of  any 
clause.  These  are  the  “major”  grammatical  relations  — SUBJECTT,  DIRECTT-OBJECr, 
INDIRECT-OBJECT.  The  0*IHER-(car}  relations  can  be  assigned  arbitrarily  many  times, 
subject  only  to  the  constraint  that  semantic  roles  be  filled  only  once.  Multiple  mapping 
units  ,.iat  assign  the  same  major  grammatical  relatitm  are  also  allowed,  subject  only  to 

^Noie  that  this  assumption  of  optionality  means  that,  as  far  as  any  given  lexical  item  is  concerned,  any 
grammatical  realization  of  a  semantic  role  is  optional,  including  SUBJECT.  However,  as  is  weD  known. 
English,  unlike  Italian  or  Greek,  requires  an  overt  subject  in  finite  clauses.  This  requirement  is  imposed  by 
the  grammar,  raUiCT  than  by  individual  lexical  items. 

^This  requirement,  in  effect,  implements  the  Functional  Consistency  Condition  trf'LPO  [28].  This  require¬ 
ment  and  the  requirement  that  completion  predicates  be  satisfied,  in  effect,  implements  the  Theta  Crirerion 
of  GB  theory  [29]. 
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the  above  constraint  and  of  course  to  the  constraint  on  the  unique  assignment  of  semantic 
roles.  This  is  useful  for  handling  certain  types  of  polysemy — specifically,  the  semantic 
overloading  of  syntactic  argument  positions. 

For  modifiers  which  are  normally  treated  as  adjuncts,  such  as  temptnal  or  locative 
modifiers,  our  framework  jarovides  a  notion  oi  “free”  mapping  units  associated  with  dis¬ 
tinguished  roles  (such  as  TIME-OF-DAY,  etc.).  Such  units  do  not  have  to  be  included  in 
the  map  for  individual  lexical  items  t^ose  labeled-argument  predicate  translation  includes 
the  role.  In  the  example  map  above,  TIME-OF-DAY  is  such  a  “free”  mapping  unit 

Finally,  we  should  point  out  that  while  the  map  information  structures  can  handle  a 
considerable  degree  of  variation,  it  is  not  necessary  for  any  one  map  to  handle  all  the 
possible  variations  associated  with  a  verb.  A  verb  can  have  multiple  maps  in  the  case  of 
conventional  lexical  ambiguity,  just  as  it  can  have  multiple  subcategorization  frames  in 
other  approaches. 


3.43  Use  of  Mapping  Units  in  Grammar  and  Semantic  Interpretation 

Standard  phrase  structure  rules  augmented  with  features  [74]  are  in  our  approach  further 
augmented  with  the  non-constituent  predicates  AVAILABLE,  VP-BIND,  and  COMPLETE- 
WFF,  along  with  the  selector  CXINSTIT.  The  following  is  an  example  of  the  grammar  rule 
that  assigns  DIRECT-OBJECT,  reduced  to  include  only  features  of  interest: 


(VP  :MAP  :BZ1IIDIN6S2} 

—¥ 

(VP  :mp  :BinDIN6Sl) 

{AVAILABLE  DIRECT-OBJECT  :MAP  zBIBDIKGSl} 

(MP  : TRANS) 

{VP -BIND  DIRECT-OBJECT  :MAP  {CONSTIT  (NP)  (1)}  :BINDIN6S1  :BIRDIN6S2} 


The  predicate  AVAILABLE  takes  a  grammatical  relation,  a  map,  and  a  bindings  list,  which 
is  a  list  pairing  mapping  units  with  role  fUlers.  It  is  satisfied  if  there  is  a  unit  in  the  mq> 
with  that  grammatical  relation  such  that  the  semantic  role  of  that  unit  is  not  set  in  die 
bindings  list,  and  the  grammatical  relation  is  not  assigned  in  the  bindings  list  (if  it  is  a 
“major”  grammatical  relation). 

The  predicate  VP-BIND  takes  a  grammatical  relation,  a  map,  an  entire  constituent 
(retrieved  by  the  function  CONSTIT  as  seen  above)  an  input  bin^ngs  list  and  an  output 
bindings  list.  If  it  succeeds,  it  produces  a  new  bindings  list  containing  an  additional  pair 
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of  unit  and  filler.  It  will  succeed  if  it  finds  a  free  unit  in  the  map  diat  can  be  matched  widi 
the  passed-in  constituent  both  syntactically  and  semantically.  VP-BIND  will  have  mote 
than  one  solution  if  it  finds  multiple  units  with  these  properties,  in  which  case  there  are 
multiple  parses.  (Note  that  the  AVAILABLE  predicate  is  realty  only  necessary  to  prevent 
the  parser  firom  looking  for  a  constituent  that  would  only  wind  up  not  being  attachable  to 
the  VP.) 

The  pair  of  map  and  bindings  effectively  constitutes  the  meaning  of  the  VP,  and  can 
be  likened  to  a  an  application  of  lambda-expression  (the  miq))  to  arguments  (the  bindings). 
The  difference  is  that  while  die  arguments  to  a  regular  lambda-exptessitni  can  either  be 
bound  all  at  once  or  in  some  fixed  order  (e.g.  through  currying)  the  arguments  to  a  iruq) 
are  referred  to  by  label,  and  can  be  applied  in  any  order  we  please. 

Currently,  the  maps  only  provide  optionality  information,  while  the  relative  (»der  of 
complements  is  enforced  by  the  grammar  via  grammatical  relations.  This  has  the  advantage 
that  certain  ordering  constraints  need  only  be  stated  once,  as  opposed  to  over  and  over  again 
in  map  entries.  An  example  of  a  rule  imposing  ordering  constraints  is  the  ditransidve  VP 
rule,  which  handles  “Show  me  the  flights”: 


(VP  :MAP  :BX]flDZN6S2) 

(VP  :MAP  :BXNDZN6S) 

{AVAILABLE  ZNDXBBCT-<»JBCT  :MAP  zBXBDXBGS} 

{AVAILABLE  DXRBCT>OBJECT  '.HAP  '.BZKDZIIGS} 

(BP  :TRANS1) 

{VP-BZBD  ZBDXFECT-OBJZCT  :MAP  {CONSTZT  (BP)(1)}  :BIBDINGS  :BZBDXBGS1} 
(BP  :TRABS2) 

{VP-BZBD  DIBECT-OBJECT  :MAP  {CONSTZT  (BP) (2)}  :BZBDZB6S1  :BZBDZB6S2} 


The  constraint  that  the  subject  precedes  post-verbal  complements  is  expressed  by  die 
clause-level  S  rule,  which  assigns  the  relation  SUBJECT: 


(ROOT-S  (Q0B8TZ0N)  .'MAP  :BZBDZBGS2) 

— ♦ 

(BP  ;TBABS) 

(VP  :MAP  :BZBDZBGS1) 

{VP-BZBD  SUBJECT  :MAP  {CONSTZT  (BP)(1)}  :BZBDZNGS1  :BZBDZB6S2} 


The  completion  conditions  for  the  clause  are  enforced  by  the  rule  for  the  top-most  node, 
START.  This  rule  contains  a  predicate  COMPLETE-WFF  that  takes  a  miq),  bindings  list. 
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and  delivers  an  output  formula: 


(SXART  (QDBRY  tUFF) ) 

(ROOT-S  (QDBSTZON)  :WlS  rBZNDZHGS) 
{COMPLBTB-IIFr  :inP  :BZ]IDZnSS  :1IFr} 


COMPLETE*  WFF  enforces  the  completion  ctxiditions  of  die  mqi  and  reduced  die  nu^  and 
bindings  combination  to  a  formula  if  these  amditkms  are  satisfied. 

The  formula  to  be  generated  is  specified  by  the  translation  rule  component  of  the  miq>. 
This  uranslation  rule  can  really  be  regarded  as  a  kind  of  meaning  postulate  for  die  predicate 
that  is  associated  with  it  direcdy.  It  consists  of  an  ordinary  logic  expression  containing 
references  to  the  argument  labels  of  the  predreate.  Repeared  below  is  die  translatUMi  for 
the  predicate  FLYl: 


(P-AND  (fllght-dnst  nZCBT-Or  DBST-^Ttf) 
(flight -orig  IXZGBT-Or  ORZG-CZTf) 
(flight-dapartuzn-tlsMi  rLZWt-Or 

TZME-ar-DAX}  ) 


To  generate  the  formula,  the  fillers  of  the  argument  roles  are  substituted  for  diese  refer¬ 
ences.  The  P-AND  is  a  meta-conjunction  operator  with  die  property  that  if  any  oi  the 
role  references  of  cme  of  its  conjuncts  are  unfilled,  that  ctmjunct  is  left  out  of  die  final 
framula.  In  this  way  we  are  not  required  to  generate  an  existential  quantification  for  a 
missing  argument  place  (as  for  example  the  departure  time  of  the  flight). 

There  are  certain  instances  in  which  an  exisrential  quantificatimi  is  generated,  however. 
If  a  semantic  role  has  merely  been  restricted  instead  of  filled,  a  narrow-set^  existential 
quantification  is  generated  and  the  variable  of  diis  quantification  substituted  for  die  retie 
reference  in  the  translaticm  rule.  Thus  for  “Flight  1  flies  before  3  inn”  we  would  have: 


(axists  t  (pxaoada  t  (tlM  3  0  pn)) 

(flight-dapaxtuxa-tlaM  (fUght-no  1)  t)) 
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3.4.4  Predicative  Metonymy 


Another  case  in  which  narrow-scope  existendals  ate  generated  is  the  case  of  predicative 
metonymy,  in  which  the  semantic  role  in  qiKstion  has  been  type-conced  to  accept  an 
argument  of  a  type  different  from  its  restriction.  In  this  type  of  metonymy,  the  referent 
of  this  argument  does  not  change.  Instead  a  relation  is  established  between  it  and  an 
indefinite,  existentially  quantified  object  of  the  proper  type.  Thus  for  “Delta  Airlines  flies 
from  Boston  to  Baltimore”  we  would  have: 


(•xi.8ts  X  flight  (alxllna-of  x  Dttlta) 

(and  (oxlg-clty  x  boaton) 

(daat-clty  x  baltlaora))) 


The  distinction  between  referential  and  predicative  metonymy  only  becomes  visible  when 
the  actual  referents  of  NPs  are  sought,  as  in: 


What  airlines  fly  to  Boston? 

What  wide-body  jets  serve  dinner? 


In  the  first,  it  is  implausible  that  “airlines”  is  being  used  to  refer  to  some  set  of  flights,  since 
every  flight  is  on  some  airline  and  there  is  no  constraint  In  the  second,  *  Vide-body  jets” 
is  far  more  likely  to  refer  to  some  set  of  flights,  since  not  every  flight  is  on  a  wide-body 
jet 


Predicative  metonymy  is  an  essentially  local  phenomenon,  while  referential  metonymy 
is  an  essentially  global  one.  Our  present  implementation  assumes  predicative  metonymy 
only  and  allows  only  a  limited  set  of  biiuuy  relations.  Processing  is  such  as  to  prefer 
attachments  that  do  not  require  metonymy,  by  assigning  a  lower  probability  at  parse¬ 
time  to  parses  which  do  require  it;  see  Section  4.2  for  discussion  of  the  use  of  statistical 
mechanisms  to  control  parse  preference.  This  is  necessary  to  exclude  an  umeasonable 
parse  of: 


Show  flights  to  Denver  on  wide-body  jets  serving  dinner. 


That  is,  one  in  which  the  participial  modifier  “serving  dinner”  is  attached  to  the  Noun 
Phrase  “wide-body  jets”,  rather  than  to  the  containing  Noun  Phrase  “flights  to  Denver  on 
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wute'body  jets”.  The  lower  attadunent  (to  “wide-body  jets”)  would  entail  coocing  ”wide- 
body  jets”  to  refer  to  flights,  giving  us  a  semantic  interpretaticMi  involving  refeanrace  to 
“flights  on  flights”.  The  higher  attachment  (to  “flights”)  entails  no  such  coercion  and  no 
such  interpretive  problems. 


3.4.5  Other  Benefits 


The  combination  of  labeled-aigumoit  predicate  and  translation  rule  offers  several  bmefits 
not  yet  mentitmed.  One  is  diat  a  given  predicate  can  be  shared  between  different  lexical 
entries  which  provide  different  syntactic  realizadtms  of  it  For  example,  in  die  AUS  domain 
the  verbs  “deparf  ’  and  “origmaie”  have  very  similar  core  meanings,  yet  have  semantic  rdes 
realized  by  (flfferent  prepositions: 


The  flight  departs  fixim  Boston. 
The  flight  originates  in  Boston. 
*The  flight  departs  in  Boston. 


The  words  are  not  synonyms  in  the  normal  sense  that  one  can  be  substituted  for  the  other 
in  such  a  way  as  to  preserve  grammaticality.  But  thdr  common  semantic  content  can  be 
represented. 

Another  advantage  is  that  a  denotaticmal  smnantics  wiflt  optionality  is  implemaited 
without  requiting  Davidsonian-style  event  quantifications  [34].  While  event  objects  make 
sense  in  some  contexts,  having  an  existential  quantification  over  events  for  every  verb  is 
frequently  inconvenient  in  furtlm  processing.  Ciertainly  it  is  so  in  the  ATIS  domain,  whoe 
the  chief  semantic  outcome  of  clauses  seems  to  be  a  set  (ff  ptedicatirais  on  attributes  of 
flight-individuals  and  there  really  are  no  “events”  as  such  at  Event  quantificatitms  are 
not  precluded,  however — they  could  be  produced  with  a  different  translation  rule  schema. 


3.4.(»  Comparison  with  Other  Work 

Our  treatment  of  the  syntactic  aspects  of  subcategorization  is  most  like  tfiat  of  PATR-II  [87]. 
Both  PATR-n  and  the  m^tping  units  iqtproach  use  recursive  VPs,  with  each  level  of  VP 
structure,  in  effect,  “peeling  off”  a  sinj^e  coistituent  of  the  head  verb’s  complement  list^ 

‘Our  qiproach  allows  mote  tiban  a  single  complenient  consUtuem  to  appear  at  a  given  levd,  however,  as 
the  dioansitive  VP  rale  above  shows. 
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A  major  difference  between  the  two  approaches  is  that  the  PATR-II  system  of  subcatego- 
lization  is  essentially  limited  to  popping  constituents  off  the  subcategorization  list,  in  fixed 
order,  requiring  a  separate  subcategorization  list  for  each  variation  in  order.  Our  approach 
allows  complements  to  be  found  in  whatever  order  the  grammar  will  allow  them.  In  tiiis 
respect,  it  is  more  like  the  UD  system  [50],  which  has  an  operator  for  “non-deterministic 
extraction  from  arbitrary  list  positions”.^  However,  our  system  does  not  literally  remove 
units  firom  the  map,  but  rather  simply  marks  them  as  no  longer  available  for  binding. 
Moreover,  the  mechanism  of  allowing  multiple  syntactic  realizations  for  a  given  semantic 
relation  in  a  single  representation  of  complement  structure  is,  to  our  knowledge,  unique. 

Related  woric  (mi  the  semantic  aspects  of  argument  optionality  has  been  reported  by 
Palmer  [71],  [33].  Our  work  differs  from  this  mainly  in  the  tighter  coupling  of  syntax 
and  semantics  during  processing  and  the  use  of  recursive  VP  structures,  which  potentially 
allows  for  an  elegant  solutitm  of  cases  of  nm-constituent  conjunction,  not  addressed  in 
the  other  work.  Our  system  also  has  a  finm*  grained  treatment  of  optionality;  whereas  [71] 
and  [33]  divide  argument  roles  into  OBLIGATORY,  ESSENTIAL,  and  NON-ESSENTIAL 
roles,  we  provide  a  richer  set  of  possibilities.  The  decoupling  of  syntactic  realizations  of 
a  role  in  a  mapping  unit  from  the  semantic  typing  ccmstraints  on  that  role,  to  allow  for 
different  types  of  metonymic  extension  is  also  a  distinction.  On  tbt  other  hand,  the  use  of 
named  thematic  or  semantic  roles  which  cut  across  particular  predicates  in  [71]  and  [33] 
might  provide  a  more  compact  form  for  capturing  linguistic  generalizatitms,  to  the  extent 
that  such  a  theory  of  thematic  relations  is  well-motivated. 


3.5  Generalized  Grammatical  Relations 


The  use  of  labelled  arguments  in  the  grammar,  introduced  above  in  Section  2.4,  is  a 
development  of  and  generalization  of  our  use  of  mapping  units  for  subcategorization.  Note 
that,  in  a  rule  such  as: 


(W  :iaP  :BZ1IDXN6S2) 

(VP  :iaP  .‘BZMDZIIGSI) 

{AVAILABLE  DZBXCT-OBJBCT  :inP  iBZHDXBGSl} 

(HP  : TRANS) 

{VP-BZND  DZBXCT-OBJBCT  :mP  {CONSTZT  (HP)  (1)}  :BIHDXN6S1  :BZHDZH6S2} 


the  NP  daughter  of  VP  is  clearly  meant  to  be  interpreted  as  die  DIRECT-OBJECT.  How- 

Other  systems  which  have  a  similar  mechanism  include  the  Lilog  system  of  IBM  Germany  and  the  MiMo 
machine  tnmslation  system. 
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ever,  that  fact  is  not  directly  stated  in  the  rule  but  is  radier  implied  by  the  fact  that  this 
rule  is  only  applicable  in  there  is  an  available  DIRECT-OBJECT  and  that  building  this  rule 
binds  this  previously  available  DIRECT-OBJECT  relaticm.  Introducing  relational  labels 
directly  into  grammar  rules  (1)  makes  the  grammatical  relations  of  all  the  constituents  of 
the  rules  obvious;  (2)  makes  the  former  “available”  and  “(vp-)bind”  relations  implicit  in 
the  grammatical  label  (i.e.  we  cannot  bind  a  grammatical  relaticm  unless  it  is  available 
to  be  bound),  a  more  natural  state  of  affairs;  and  (3)  allows  us  to  develt^  a  system  of 
semantic  inteipretation  which  applies  generally  across  various  grammatical  relations.  We 
expand  on  these  points  in  the  rest  of  this  section. 

Recall  that  grammatical  relations  are  inccnpoiated  into  die  grammar  by  giving  each 
element  of  the  right  hand  side  of  a  grammar  rule  a  grammatical  relation  as  a  label.  Smne 
typical  rules  are,  in  schematic  form: 


(HP  . . .) 

:BEAD  (HP  ...) 

:PP-COMP  (PP  :PREP  ...) 

(H-BAR  ...) 

— ♦ 

:PBE-N<»«  (N  ...) 

‘.HEAD  (N-BAR  .  . .) 


The  first  indicates  that  an  NP  may  have  an  NP  as  a  head  and  a  PP  as  an  adjunct, 
with  the  grammatical  relation  :  PP-COMP  holding  between  them  (the  actual  operation  of 
binding  a  :  PP-COMP  splits  it  into  a  number  of  sub-relations  based  on  the  preposition  in 
a  manner  already  seen  in  the  discussion  of  constraint  relations  in  Section  3.2.1,  so  that 
this  face  can  be  safely  ignored  here).  The  second  indicates  that  the  head  need  not  occur 
as  the  first  constituent  on  the  right  side.  All  diat  is  required  is  that  one  of  the  right-hand 
elements  is  labeled  as  the  “head"  of  the  rule,  and  it  is  the  source  of  infonnatitm  about 
the  initial  semantic  and  syntactic  “binding  state”,  the  equivalent  of  the  BINDING  variable 
in  our  earlier  mapping  units  scheme.  This  binding  state  ctmtrols  whethn  or  not  die  other 
elements  of  the  right-hand  side  can  “bind”  to  the  head  via  the  relaticm  that  labels  diem. 
Semantics  is  associated  with  grammatical  relations,  not  with  particular  grammar  rules  (as 
are  Montague-grammar  and  most  unification-based  semantics  systems,  including  earlio' 
versions  of  DELPHI’S  semantics). 
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3.5.1  “Binding  rules’* — the  Semantics  of  Grammatical  Relations 


The  implementation  of  the  use  of  such  binding  states  in  the  transduction  of  granunadcal 
relations  to  semantic  structure  is  facilitated  by  the  compiled  constraint  relations  we  have 
introduced  inU)  our  unification  grammar  formalism;  see  Section  2.2.1.  A  separate  system 
of  “binding  rules”  for  each  gratmnatical  relatitm  licenses  the  binding  of  a  constituent  to 
a  head  via  that  relation  by  specifying  the  semantic  implications  of  binding.  These  rules 
generally  specify  aspects  that  must  be  true  of  tire  semantic  structure  of  the  Iwad  and  bound 
constituent  in  order  for  the  binding  to  take  place,  and  may  also  specify  certain  syntactic 
requirements.  They  may  take  into  account  the  existence  of  previous  bindings  to  the  head, 
allowing  certain  semantic  roles  (such  as  time  specification)  to  be  filled  multiply,  while 
other  semantic  roles  may  be  restricted  to  having  just  one  filler.  Thus,  given  tbe  use  of 
generalized  grammatical  relations  throughout  the  grammar,  the  rule  introducing  a  direct 
object,  which  was  presented  above  as; 


(VP  :MAP  :BZNDZNGS2) 

(VP  :MltP  :BZNDZNGS1) 

{AVAILABLE  DXBECT-QBJECT  :MAP  :BZNDXBGS1} 

(HP  :  TRANS) 

(VP-BXND  DIRECT-OBJECT  :MAP  {CQNSTZT  (NP)  (1)}  :BINOINGSl  :BZNDZNGS2} 

now  has  the  much  simpler  form: 

(VP  ...) 

— ♦ 

:BEAD  (VP  ...} 

: DIRECT-OBJECT  (MP  ...) 


In  effect,  we  may  consider  the  relational  label  :  DIRECT-OBJECT  to  be,  in  effect,  a 
macro,  which  performs  the  functions  of  the  previous  constraint  relations  AVAILABLE, 
which  made  sure  that  the  map  of  the  head  V  ccmtained  an  unbound  Direct  Object,  and 
VP-BIND,  which  bound  the  direct  object  NP  found  to  the  Direct  Object  posititn  in  the 
head  V’s  map,  perfonning  any  necessary  »mantic  type  checking. 

Note  also  that  the  semantic  variables  :MAP,  :BZNDZN6S1,  :BZllDZN6S2,arMi  :  TRANS 
do  not  appear  in  the  second  version  of  the  rule.  This  is  because  our  current  implementaticM) 
of  the  syntax-semantics  interface  no  longer  includes  semantic  variables  as  part  of  the  term 
structure  of  syntactic  categories.  Rather  syntactic  rules  and  semantic  information  are  bun¬ 
dled  together  in  a  record  data  structure  (the  Common  LISP  defstruct).  Constraint  relatitms 
that  perform  semantic  operations,  such  as  checking  for  the  availability  of  a  grammatical 
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relation  to  bind,  binding  semantic  roles,  etc.  are  able  to  access  this  semantic  infonnatkxi 
and  perform  any  necessary  operations  (such  as  semantic  type  consistency  checking).  How¬ 
ever,  semantic  interpretation  in  the  current  version  of  DELPHI  makes  no  real  use  of  tte 
standard  unification  operation  used  to  build  syntactic  structures. 

As  constituents  are  added  to  the  recursive  structure  the  binding  list  is  extended.  As 
layers  are  added  to  the  “onion”,  a  simple  linear  list  of  bindings  is  maintained  representing 
the  head  and  its  grammatical  relatitm  to  each  of  die  ctmstituents  added  with  each  layer 
of  the  onion.  Semantic  binding  rules  are  used  to  verify  die  local  semantic  plausibiUty 
of  a  structure,  i.e.  the  semantic  plausibility  of  each  prc^osed  grammatical  relation.  The 
next  phase  of  semantic  inteipretatitm  takes  place  when  the  miion  is  ctnnplete,  i.e.  when 
a  constituent  X  is  inserted  as  other  dian  the  head  of  a  larger  constituent  This  situatUxi 
provides  evidence  diat  the  outmmost  layer  of  die  cmion  has  been  reached,  and  that  no  more 
adjuncts  are  to  be  added.  At  this  time  it  is  possible  to  evaluate  semantic  rules  that  check 
for  completeness  and  produce  an  ‘interpretation”  of  the  constituent  These  completitHi 
rules  operate  directly  on  the  binding  list,  not  on  the  recursive  left  m*  right  branching  tree 
structure  produced  by  direct  {plication  of  the  grammar.  The  actual  tree  structure  is  at 
this  level  immaterial,  having  been  replaced  by  the  flattened  binding  list  representatitm  of 
relational  structure. 


3.5  J  Robustness  Based  on  Statistics  and  Semantics 


Simply  having  a  transduction  system  with  semantics  based  rni  grammatical  relatitms  does 
not  deal  with  the  issue  of  robusmess — the  ability  to  make  sense  of  an  input  even  if  it  cannot 
be  assigned  a  well-formed  syntactic  tree.  The  difficulty  with  standard  syntactic  techniques 
is  that  local  syntactic  evidence  is  not  enough  to  accurately  determine  grammatical  relations. 
An  NP  (e.g.  “Jdm”)  followed  by  a  verb  (e.g.  “flew”)  iiuiy  be  the  subject  of  that  verb 
(e.g.  “John  flew  to  Boston”)  or  may  be  unrelated  (e.g.  “llie  man  I  introduced  to  J(dm 
flew  to  Boston”).  The  standard  way  of  getting  around  this  is  to  attempt  tt>  find  a  globally 
consistent  set  of  grammatical  relation  labels  (Le.  a  global  parse)  and  make  use  of  fbc 
fact  that  the  existence  of  a  global  parse  contaiiung  a  given  relation  is  strcmger  evidence 
for  that  relation  than  local  structure  (althtmgh  syntactic  ambiguity  makes  even  such  global 
structures  suspect).  This  is  indeed  the  best  tq>proach  if  all  you  have  available  is  a  syntactic 
grammar. 

The  strategy  we  have  developed  in  DELITfl  is  based  on  the  existence  of  two  odier 
sources  of  information.  In  the  first  place  we  have  semantic  cmistraints  tiuu  can  be  iq^lied 
incrementally,  so  that  we  can  check  each  pressed  grammatical  relation  for  semantic  co¬ 
herence  in  Ae  context  of  other  assumed  grammatical  structures.  Additionally,  we  have 
statistical  information  on  the  likelihood  of  various  word  senses,  grammatical  rules,  and 
grammatical-semantic  transductiems,  as  discussed  in  Section  4.2,  below.  Thus  we  can  not 
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only  rule  out  many  locally  possible  grammatical  relations  cm  the  basis  of  semantic  incoher¬ 
ence,  we  can  rank  alternative  local  structures  on  the  basis  of  empirically  measured  statistics. 
The  net  result  is  that  even  in  the  absence  of  a  single  global  parse,  we  can  be  reasonably 
sure  of  the  local  grammatical  relations  and  semantic  ctmtent  of  various  fragments.  We  can 
even  give  numerical  estimates  of  the  likelihood  of  each  such  structure. 


3.5  J  Advantages  of  This  Approach 

The  separation  of  syntactic  grammar  rules  from  semantic  binding  and  ctmipletion  rules  has 
important  consequences  for  processing.  Rrst,  it  enables  the  notion  of  grammatical  relation 
to  be  separated  frmn  the  notion  of  tree  structure,  and  thus  gieady  facilitates  fragmrat 
parsing;  see  Sectitm  4.3.  Second,  while  it  allows  syntax  and  semantics  to  be  strongly 
coupled  in  terms  of  processing  (parsing  and  semantic  interpretation)  it  allows  them  to  be 
essentially  decoupled  in  terms  of  notation.  This  makes  die  grammar  and  tite  semantics 
considerably  easier  to  modify  and  maintain. 

We  believe,  however,  that  in  tiie  Imtg  term  die  most  important  advantage  is  dtat  diis 
view  leads  us  to  a  new  land  of  language  model,  in  which  knowledge  can  be  much  more 
easily  extracted  through  automatic  training.  We  view  the  role  of  the  granunar  as  codifying 
the  way  that  tree  structure  provides  evidence  for  grammatical  relations.  Thus  the  rule 


(HP  .  .  .) 

:BEAD  (NP  ...) 

:PP-CGMP  (PP  :PBXP  ...) 


says  that  a  noun  phrase  followed  by  a  prepositional  phrase  provides  evidence  for  die  relation 
PP-CX>MP  between  the  PP  and  NP  head. 

The  separation  between  rule  types  will  allow  us  for  die  first  time  to  consider  the  effect 
of  grammatical  relations  on  meaning,  indepmidmidy  of  die  way  that  evidence  for  diese 
relations  is  produced  by  the  parser.  One  effect  ctf  diis  is  to  make  it  possible  to  use  a 
hypodiesized  semantic  interpretation  of  a  set  of  tree  firagments  to  generate  a  new  syntactic 
rule. 

Thus,  in  normal  operation,  the  primary  evidence  for  a  grammatical  relation  is  die 
result  of  actually  parsing  part  of  an  input  However,  since  grammatical  relations  between 
constituents  entail  semantic  relations,  if  we  can  make  an  estimate  of  the  likelihood  of  certain 
semantic  relations  based  cm  dcxnain  knowledge,  pragmatics,  and  task  models,  etc.,  it  is  in 
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principle  possible  to  use  abductive  reasoning  to  suggest  likely  grammatical  lelatirms,  and 
thereby  pr(^)ose  new  grammar  rules.  In  effect,  grammatical  relations  form  an  abstract  level 
of  representation  that  greatly  simplifies  the  interaction  of  syntactic  and  semantic  processing. 


3.6  Unification  Semantic  Interpretation  in  DELPHI 


In  this  section,  we  detail  various  problents  solved  in  the  earlier,  unificatian  sonantics, 
version  of  DELPHI  We  believe  that  some  of  the  mechanisms  we  implmnented  may  be  use 
to  other  researchers.  Moreover,  all  of  the  semantic  issues  raised  here  must  be  faced  in  any 
treatment  of  semantics. 

We  first  introduced  semantic  interpretation  into  grammatical  analysis  in  DELPHI  by 
adding  semantic  features  directly  into  the  gianunar  rules,  as  is  common  in  complex  feature 
based  grammar  formalisms.  We  describe  hm  our  initial  efforts  along  these  lines,  some 
of  which  pre-dated  the  restructuring  of  the  grammar  described  in  Section  2.3.  The  facility 
for  case  analysis  with  constraint  relations  described  in  Section  2.2  is  also  used  in  semantic 
interpretation,  where  the  meaning  representation  of  constructicms  are  computed  in  terms  of 
values  of  certain  features  which  carmot  always  be  known  in  advance. 


3.6.1  Analysis  of  Noun  Phrases  and  Noun  Modifiers 

Figure  3.7  presents  a  simplified  form  of  the  rule  for  regular  count  Noun  Phrases  used  in 
the  version  of  our  grammar  to  which  semantic  features  were  first  added.  Semantic  features 
are  underlined. 


(NP  :NSUBCATnUUIE  (A6R  :P  :N)  :1IB  (O-TKWI  ;0  ;V3Ut  ;»0M5)) 
— » 

(DBTEaNZNSR  :V  :1IH  ;1I0M1  ;W0M2  jQ) 

(OPmCNPOSADJP  (ABR  :P  :N)  ;W0M4  ;H0M5) 

(OPTJVDJP  (ABR  :P  :H)  (PBXNONADJ)  :W0K3  :Rm4) 

(H-BMt  :NSUBCXTPRMIB  (A8R  :P  :N)  ;HOia) 

(OPTNPADJDIICT  (ABR  :P  :H)  ;HOIC  ;N0M3) 


Figure  3.7:  Unificaticm  Semantics  for  NP 


This  rule  generates  NPs  that  have  at  least  a  determiner  and  a  head  noun,  and  which 
have  zero  or  more  prenominal  superlative  or  comparative  adjectives  C‘chei^st**,  *Tess 
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expensive”  etc.),  prenominal  positive  adjectives  C'ovemight”,  “alleged”),  and  adjuncts  (“in 
Boston”,  “that  arrive  firom  Denver'*)-  Its  effect  is  to  take  :NCN1,  the  NOM-SEM  semantics 
of  die  head  noun  passed  up  through  N-BAR,  and  thread  it  throu^  die  various  modifications, 
add  a  quantifier  and  a  variable  for  quantificadmi,  and  deliver  the  resulting  package  as  the 
semantics  for  die  whole  NP. 

NOM-SEM  terms  have  die  following  structure.  The  principal  functor  of  diis  type  is 
NOM,  which  has  the  argument  structure; 


(HON  PJIRMI-LZST  SBT-BXP  SORT) 


The  PARAM-LIST  is  a  (possibly  empty)  list  of  parameters,  used  to  indicate  die  fiee 
argument  places  in  a  relational  noun.  SET-EXP  is  a  logical  expression  whidi  dmotes  a 
set  of  individuals.  SORT  is  a  term  structure  which  represents  die  semantic  class  of  the 
elements  of  SET-EXP. 

The  initial  NOM-SEM  ctxnes  from  the  head  N-BAR,  and  is  indicated  by  :IIONX,  the 
variable  in  that  argument  posititm.  It  is  first  of  all  passed  to  the  DETERMINER.  Along 
with  a  quantifier,  :Q,  the  DETERMINER  passes  back  a  possibly  modified  NOM-SEM, 
:NQM2.  The  reason  for  this  is  diat  the  determiner  may  be  possessive,  and  a  possessive 
determiner  effectively  functions  as  a  noun  modifier  which  enters  into  scope  rehukms  widi 
other  modifiers  of  the  NP.  Consida’  die  noun  phrase  “John’s  best  bod^.  This  cannot  be 
analyzed  as 


(SST  X  (BEST'  BOOK')  (XQDAI*  (ADTBOR-Or  X)  JORH' ) ) 


that  is,  8*i  the  subset  of  the  best  bodts  in  the  world  that  also  hiqipen  to  be  written  by  Jolm. 
Instead,  it  must  be  analyzed  as: 

(BX8T'  (SBT  X  BOOK'  (BQCRL  (JU7XSCXt>0r  X)  JOBH'))) 


that  is,  the  best  of  the  books  written  by  Jdm. 

The  essential  point  is  that  the  possessive  DETERMINER  must  carry  out  its  modification 
before  other  elements  of  the  NP  can,  yet  must  still  follow  all  other  mo^catitms  in  affixing 
a  quantifier  to  the  final  result  of  the  NP.  If  the  determiner  is  conceived  of  as  just  a  hi^ier- 
order  function  returning  a  single  value,  as  in  Mtmtague  Grammar,  it  is  difficult  to  see  how 
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this  can  be  done.  The  virtue  of  this  approach  is  that  it  allows  the  determiner  to  return  a.c 
separate  values  both  a  quantifier  and  a  suitably  modified  nominal. 

If  the  determiner  is  not  possessive  it  simply  passes  up  the  same  NOM-SEM  it  was 
originally  given.  The  NOM-SEM  returned  by  the  DETERMINER,  whether  modified  or 
not,  is  then  passed  down  to  the  adjuncts  of  the  NP  as  :N0M2,  which  modify  it  and 
return  :Ntt(3.  This  is  then  passed  to  the  regular  (imn-superlative)  prencMninal  adjectives 
for  further  modification,  returning  ;N0M4.  Fiiudly,  :N0M4  is  passed  to  the  ctmsdtuent 
OPTNONPOSADJP,  the  optional  superlative  adjectives.  The  fiiud  NOM-SEM,  :NGN5,  is 
passed  up  to  become  an  element  of  die  complete  Q-TERM  semantics  of  the  NP. 

Q-TERMs  have  the  following  form: 


(Q-TESM  QQANTZFZER  VAR  NOM-SBM) 


The  QUANTIFIER  is  (uie  of  the  many  quantifiers  that  correspond  to  determiners  in 
English:  ALL,  SOME,  THE  and  various  WH  determiners.  Proper  NPs  were  treated  as 
definite  descriptions  in  this  version  of  our  system;  they  were  thus  represented  using  the 
THE  quantifier. 

The  VAR  denotes  a  variable  of  the  object  language,  and  is  left  uninstantiated  (being 
filled  in  by  a  unique  object-language  variable  by  a  quantifier  module).  The  NOM-SEM 
represents  the  set  that  the  quantification  ranges  over,  it  effectively  represents  the  semantics 
of  the  head  of  the  NP  after  modification  by  the  NP’s  other  constituents. 

Note  that  the  structure  of  Q-TERMs  and  NOM-SEMs  means  that  the  SORT  field  of  the 
NOM  is  accessible,  via  one  level  of  indirection,  from  the  Q-TERM  NP  representation.  It 
is  this  feature  which  provides  the  means  for  selectional  restriction  based  on  semantic  class 
elsewhere  in  the  grammar. 

Semantic  classes  (arranged  in  a  hierarchy)  are  represented  as  cmnplex  terms,  whose 
arguments  may  themselves  be  complex  terms.  A  trandation  (described  in  Section  3.6.2)  is 
established  between  semantic  classes  and  these  terms  such  that  non-empty  overliqi  between 
two  classes  corresponds  to  unifiability  of  the  corresponding  terms,  and  disjmnmess  between 
classes  corresponds  to  non-uitifiability  of  the  corresponding  terms. 

As  an  example  of  the  action  of  the  modifying  elements  in  the  above  rule,  consider  tiie 
following  rule  for  generating  an  NP  adjunct  from  a  PP: 
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(0PT11PA0JT3IICT  :NOia  :N0M2} 

— ♦ 

(PP  :PP) 

{MODIFYING-PP  :PP  :N0M1  :nOM2} 


The  NOM-SEM  passed  in  from  the  containing  NP,  :N0M1,  is  in  turn  passed  down 
to  a  constraint  relation,  MODIFYING-PP,  which  takes  the  semantics  of  the  PP,  :PP,  and 
“computes”  the  modified  NOM-SEM,  :NQM2,  which  is  then  passed  back  to  the  NP  as  the 
result  of  the  modification.  Note  that  PPs  in  our  system  are  given  tmly  partial  semantic 
interpretatitms,  which  consist  of  the  preposititm  of  the  PP  and  tiie  tran^tion  of  tiie  PP’s 
NP  object  Their  representations  are  thus  of  the  following  form; 


(PP-SEM  PREP  NP-SEMMITZCS) 


MODIFYING-PP  is  used  to  encompass  different  kinds  of  PP  modification.  Relational 
modification,  where  the  PP  essentially  fills  in  an  argument,  is  handled  by  the  following 
solution  to  MODIFYING-PP; 


(MODIFYING-PP  (PP-SEM  (OFPBEP)  :IZP) 

(NCM  (PARAM  :NP)  :SET  :SORT) 
(NGN  (NO-PARAM)  :SET  :SORT)} 

-  0 


Since  this  rule  is  a  cmistraint  relation  solutitm,  its  right-hand  side  is  empty.  It  unifies  the 
NP  object  of  the  PP  with  the  ^‘parameter  NP”  of  the  argument  nominal  Of  course,  it 
will  not  be  unifiable  if  the  argument  nominal  does  not  contain  a  parameter  NP,  (m*  if  the 
parameter  NP  of  the  argument  nominal  contains  the  wrong  semantic  type. 

The  lexical  rule  introducing  the  relational  noun  “cost”  in  its  application  to  ground 
transportation  is  as  follows; 


(M  (MOM  (PARAM  #1«(Q-TERN  :Q  :VAR 

(MOM  :PARS  :SET  (GRODMD-TRAMSPOiRTATZOH-SDIMARY) ) 
(SETOr  (COST'  #1#)) 

(ZMAMZMATE  (DOLZAR-AMT)  )  )  )  )  ) 


(cost) 
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Note  the  requirement  that  the  filler  of  die  slot  be  of  sort  GROUND-TRANSPORTATION- 
SUMMARY,  and  the  co-occunence  of  this  filler  inside  the  NOM’s  set  expression. 


Of  course  PPs  can  also  occur  in  a  predicative  sense.  For  example,  an  aiiport  can  be 
“in”  in  a  city.  To  handle  this  we  have  the  following  solution  to  to  the  constraint  relation 
PREDICATIVE-PP: 


(PBSDICATZVB-PP 

(PP-SBM  (ZMPIUEP) 

:Q1  :V3kRl 

(NOM  rPARSl  :SST1  (ZMANXIIXTB  (CZTY)  )  )  )  ) 
#2b(Q-TERM  :Q2  :VaR2 

(NOM  :PAKS2  :SST2  (XNJUniATB  (AXSPORT) ) } ) 
(EQUAL  (CZTY-OF  #2#)  #1#)) 

-»0 


Note  that  this  constraint  solution  will  (mly  unify  if  the  class  of  the  NP  object  of  the  PP 
unifies  with  CITY,  and  the  class  of  the  NP  being  predicated  of  unifies  with  AIRPORT. 

When  such  a  PP  occurs  as  an  adjunct  to  an  NP,  the  derivation  passes  through  the 
following  indirect  “lifting”  rule: 


(NODIPYIMG-PP  :PP 

(MOM  :PAR  :SBT  :SORT) 

(MOM  :PAR  (SET  :VAR  :SET  iMPP)  :SORT)) 

(PMEOZCATZVE-PP  :PP 

(Q-TERN  (BOUMD-Q)  :VAlt 
(MOM  :PAR  :SET  :SORT)) 

:MFF) 


Although  the  right-hand  side  of  the  rule  is  in  tius  case  not  empty,  it  will  like  all  constraint 
relations  derive  the  empty  string  in  the  end. 

Similar  distinctions  of  modificational  power  are  seen  in  the  case  of  adjectives,  whore 
an  adjective  like  “average”  or  '‘previous”  has  the  power  to  abstract  over  free  parameters  of 
the  noun  meaning,  while  an  adjective  like  “female”  does  not  Consider  the  rule  below: 
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(OPTADJV  (AGR  :P  :N)  :POSZT  :N0N1  :1K]M3) 

— ♦ 

(AOJP  (A6R  :P  :N)  :POSZT  :ADJ-SBM) 

(OPTADJP  (Mat  :P  :N)  :POSZT  :VQN1  :MCM2) 

{M0DZ7ZYIM6-ADJ-BEADXli8  :ADJ-SBM  :M0M2  iKOMS} 

This  rule  generates  a  string  of  one  or  more  adjectives.  Nominal  semantics  is  threaded 
through  the  adjectives  right  to  left  Adjective  phrase  semantic  representations  (ADJ-SEMs) 
come  in  two  varieties: 


(MOOZFYIN6-ADJ  NQM-SEM  MOM-SBM) 

and 


(PBEDICATIVB-AOJ  MP-SEMMITZCS  HIT) 


These  represent  different  semantic  types  of  adjective.  Adjectives  like  “previous”,  with 
the  power  to  modify  the  whole  noun,  have  a  semantic  representation  headed  by  the  func¬ 
tor  MODIFYING-ADJ,  while  adjectives  like  “female”,  which  only  operate  upon  indi¬ 
vidual  elements  of  the  noun’s  extension,  have  a  rqnesentation  headed  by  the  functor 
PREDICAHVE-ADJ.  The  constraint  relation  MODIFYING-ADJ-READING  accepts  the 
first  kind  of  adjective  unchanged  and  lifts  die  second  kind  to  tte  ^ipropriate  level.  Note 
that  while  predicative  PPs  and  adjectives  can  be  “lifted”  to  the  noun  modifying  level,  the 
converse  is  not  true.  That  is,  the  system  does  not  allow  “That  value  is  previous”  or  “That 
salary  is  of  Clark”. 


3.6.2  Other  Aspects  of  DELPHI  Unification  Semantics 

Encoding  Semantic  Classes  as  Terms 


The  translation  from  semantic  classes  to  cmnplex  terms  can  be  performed  i^stematically. 
In  this  section  we  present  an  algorithm  fm-  translating  semantic  classes  to  tenns,  designed 
to  woik  on  taxonomies  of  semantic  classes  represented  in  a  system  such  as  KL-ONE  [82] 
or  NIKL  [67].  It  has  the  advantage,  important  from  the  point  of  view  of  such  systems,  diat 
it  correctly  handles  the  distinction  between  “primitive”  and  “defined”  classes  —  “defined” 
meaning  diat  the  class  is  simply  an  intersection  of  two  or  more  other  classes. 


BBN  Systems  and  Technologies 


BBN  Repost  No.  7715 


69 


The  algorithm  is  seen  in  Hgure  3.8,  where  the  main  work  is  done  by  die  function 
TRANSLATE. 

Throughout,  the  symbol  :  ANY  indicates  a  “don’t  care”  variable,  unifying  with  anydiing. 
This  is  in  fact  the  only  use  of  variables  made.  The  operatiai  REGULARIZE  is  used  to 
remove  non-primitive  classes  from  die  taxonomy,  and  set  them  aside.  It  is  simple  and  we 
do  not  give  it  here. 

We  now  consider  the  classes  PERSON,  MALE,  FEMALE,  ADULT,  CHILD,  MAN  and 
PRIEST.  MALE  and  FEMALE  are  disjoint  sub-classes  of  PERSON,  as  are  ADULT  and 
CHILD.  MAN  is  the  class  which  is  the  intosection  of  ADULT  and  MALE.  PRIEST  is  a 
sub-class  of  MAN,  but  not  identical  to  it  Following  are  the  translaticms  the  algorithm  in 
Figure  3.8  gives  to  several  of  these  classes: 


PERSON  -»  (PERSON  :ANY  :ANY) 

ADULT  -» (PERSON  (ADULT  :ANY)  :ANY) 
MALE  -♦  (PERSON  :ANY  (MALE  :ANY)) 

MAN  -»  (PERSON  (ADULT  :ANY)  (MALE  :ANY)) 
PRIEST  (PERSON  (ADULT  (PRIEST)) 

(MALE  (PRIEST))) 


Essentially,  the  algorithm  works  by  mapping  each  set  of  mutually  disjoint  children  of  the 
class  to  an  argument  place  of  the  term  to  be  associated  with  that  class.  The  term  associated 
with  a  class  has  the  same  depth  as  the  depth  of  the  class  in  the  taxonomy. 

The  translation  produces  by  this  algorithm  are  similar  to  those  produced  by  the  algorithm 
by  Mellish  [63].  We  claim  two  advantages  for  ours.  First,  and  as  already  pointed  out,  it 
takes  into  account  the  difference  between  “if’  (primitive)  ^d  if-and-<xdy-if  (non-prinutive) 
axiomatizations,  where  it  would  seem  that  the  Mellish  algorithm  does  not  Second,  it  is 
simpler,  not  requiring  such  notions  as  “paths”  and  extensions  “to”  and  “beyond”  them. 

As  a  final  cmnment  on  the  issue  of  encoding  semantic  classes  as  terms,  we  note  that 
there  is  another  encoding  method  which  may  have  been  overlodred:  that  is,  encoding  each 
class  as  a  term  which  has  the  same  number  of  arguments  as  there  are  classes.  It  works  as 
follows.  In  the  argument  position  corresponding  to  the  class  being  translated  put  a  “1”,  and 
put  a  “1”  in  argument  positions  corresponding  to  subsuming  classes  as  well.  In  argument 
positions  corresponding  to  disjoint  classes  put  a  “0”.  In  all  other  posititms  put  a  “dcmt-caie” 
variable.  While  perhaps  using  space  inefficiently,  this  encoding  will  have  aU  the  desired 
properties. 


BBN  Systems  and  Technologies 


BBN  Report  No.  7715 


70 


Using  Unification  to  Encode  Semantic  Constraints  beyond  Semantic  Type 


Not  all  constraints  on  meaningfulness  are  strictly  reflections  of  the  semantic  type  of  phrase 
denotations.  Consider  the  lexical  item  “L”  in  the  ATIS  domain.  Seen  in  a  number  of 
different  database  fields,  it  can  variously  denote  limousine  availabUty,  lunch  service,  m 
other  classes  of  smvice  available  on  a  flight  Yet  in  die  following  example  it  is  clear  diat 
its  usage  is  relevant  to  just  the  first  of  these: 


What  is  transport  code  L? 


Our  claim  is  that  it  is  not  just  die  referent  of  “L” — limousine  service  for  ground 
transportation — that  plays  a  role  here,  but  also  the  means  by  which  it  gets  to  that  ref¬ 
erent:  namely,  by  being  an  abbreviation  or  code  rather  than  a  name.  That  is,  **transport 
code  L”  is  a  meaningful  compound,  while  “transport  code  limousine”  would  not  be.  The 
lexical  entry  for  abbreviation  terms  like  “L”  reflects  this  by  taking  die  form  of  an  inverse 
function  application:  the  referent  of  the  lexical  item  “L”  is  the  ground  transpmtation  type 
that  has  the  string  “L”  as  its  abbreviation: 


(ZNVERSE-VAL*  (ABBKSV-OF}  "Z."  (GIU3t«D--SlU119SPOIIt!XATZOM) ) 


While  this  has  the  same  referent  as  the  lexical  entry  for  “limousine”  it  has  a  diflferent 
form,  one  which  the  rule  analy^g  the  construction  above  makes  use  of. 

Nominal  compounds  in  the  DELPHI  system  are  generated  by  the  following  rule: 


(M-BAR  :NC»<3) 

— » 

(N  :MCM2) 

(M-BAR  :MOia) 

{M0M-CCBIP-RBADZM6  :M0N1  :MQM2  :MOM3} 


in  which  the  constraint  relation  NOM-CX}MP-READING  computes  the  semantics  of  the 
whole  construction  from  the  semantics  of  the  head  noun  (passed  up  to  N-BAR  via  die 
variable  and  of  the  nominal  modifier  (:M0M2).  NOM-CX)MP-READING  has 

different  solutions  for  different  semantic  types  of  noun  translation.  The  relevant  me  here 
is: 
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(NON  (ZHVSnSX-VltL*  trONCTZOH  :TBRM1  :SXT)  :AB6-S08T) 

(NOM  (CGNS>P  (BEU)  (lARG  typtt  : ARC-SORT)  :RARS2) 

(BBL-APPLY*  iFUMCTIOM  :]!IP} 

:VAL-SC»tT) 

(NOM  :PARS1  (XNVERSB-VAL*  :FDNCTZON  :T8BM1  ;SBT)  :ARG-S(»T)) 


The  first  slot  of  the  NOM  terms  above  encodes  die  aisummit-taking  jaopexty  of  relatkmal 
nouns  such  as  “code”,  “salary”  or  “speed”,  and  has  been  described  above.  The  rule  states 
that  an  inverted  attribute  reference  (here,  “L”)  preceded  by  a  relational  noun  (here,  “code”) 
for  that  same  attribute  (here,  “ABBREV-OF*)  simply  refers  to  that  inverted  attribute  ref¬ 
erence. 

Our  view  is  that  the  preceding  nominal  modifier  essentially  performs  a  function  of  dis¬ 
ambiguation:  it  serves  to  distinguish  the  desired  sense  of  die  head  firom  any  other  possible 
one.  This  is  reinforced  when  the  whole  ccmipound — ^“transpmt  code  L” — ^is  considered. 
Another  NOM-OOMP-READING  rule  is  responsible  for  combining  “code”  with  “trans¬ 
port”: 


(NOM-COMP-RBADINC 

(NGM  :PARS1  :ARC-SBT  :ABC-SORT) 

(MOM  (CXJNS-P  (RELl)  (:ARC  type  :ARC-SORT)  :PARS2) 
(REL-APPLY*  iTDMCTZOM  :NP) 

:VAL-SORT) 

(NOM  (CONS-P  (RELl)  :ARC  :PARS2) 

(REL-APPLY*  (r-RESTRZCT  :TDNCTZOM  :ARC-SET)  :NP) 
:VAL-SORT) ) 


This  constrains  die  domain  of  the  relational  noun  it  modifies  to  be  just  the  set  that  is  die 
translation  of  the  modifying  noun.  Tire  semantic  type  of  tire  modifying  noun  must  be 
unifiable  with  the  argument  type  of  the  head  relational  noun.  After  this  tmificaticm  farms 
the  translation  of  the  compou^  “transport  code”,  the  resulting  type  constraints  serve  to 
distinguish  the  cmiect  sense  of  “L” — that  of  ground  transportatum — ^fiom  the  meal  service 
and  other  senses. 

Our  technique  of  allowing  rules  to  impose  restrictions  on  the  forms  of  semantic  trans¬ 
lations  themselves,  rather  than  merely  on  the  semantic  types  of  this  translations,  bears 
some  discussion  because  it  differs  firom  proposals  made  by  others,  such  as  [66].  In  diat 
work,  the  positkm  is  taken  that  inspecting  or  restricting  the  structure  of  a  logical  form  is 
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inadmissible  on  theoretical  grounds,  in  that  it  violates  or  makes  unenforceable  the  princi¬ 
ple  of  compositionality.  As  far  as  theoretical  matters  go,  we  believe  that  the  principle  of 
composidonality  is  open  to  many  interpretations  (for  example,  see  [72])  and  that  its  most 
general  interpretation  does  not  exclude  techniques  such  as  ours. 

A  more  practical  concern,  and  one  which  may  well  undb'ly  or  justify  the  theoretical 
qualm,  is  that  rules  based  on  the  structure  of  logical  form  may  not  succeed  if  that  struc¬ 
ture  happens  to  be  transfonned  (say  thrmigh  wnq^ing  with  another  structure)  inro  some 
syntactically  different  form  which  caiuiot  be  recognized  as  erne  which  the  rule  should  allow. 

This  is  not  a  problem  for  the  rules  presented  in  this  section,  since  the  operatimi  of 
nominal  compounding  is  so  tightly  localized,  operating  in  a  fimction-aigument  fashion 
that  gets  to  the  noun  meaning  “first”,  before  other  nmdifications  such  as  postnominal 
adjuncts.  Ultimately,  though,  we  feel  drat  the  real  solution  must  lie  in  a  different  approach  to 
meaning  representation,  one  in  which  ncxi-denotational  properties  of  the  utterance  meaning 
are  encoded  or  highlighted  in  a  way  that  stands  above  the  variances  of  syntactic  logical 
form. 


3.63  Impact  of  Our  First  Implementation  of  Constraint  Relations  on 
Parsing 

Constraint  relations  are  generally  useful  in  that  diey  allow  one  to  give  a  name  to  a  particular 
condition  and  use  it  in  multiple  places  diroughout  the  grammar.  Consider  verbs  which  take 
PPs  and  ADJPs  as  complements.  In  “John  became  happy”,  it  is  intended  diat  that  die 
adjective  “happy”  apply  to  the  subject  “John”.  It  would  not  make  sense  to  say  “The  table 
became  happy”.  Similarly,  in  “I  put  the  book  on  the  floor”,  the  PP  “on  the  floor”  is  intended 
to  apply  to  the  object  NP  “the  book”  and  it  would  not  make  sense  to  say  “I  put  the  idea  on 
the  floor”  Semantic  type  constraints  in  such  cases  clearly  hold  not  just  between  the  verb 
and  its  various  arguments,  but  between  the  arguments  themselves.  A  constraint  relatitm  like 
PREDICATWE-PP  can  be  used  to  express  diis  relatitxiship  between  arguments  when  it  is 
needed. 

The  core  of  the  DELPHI  parser  is  a  bottom-up  left-to-right  algoridim,  discussed  in 
more  detail  in  Quqiter  4.  Formally  speaking,  this  algoridim  can  parse  the  kind  of  grammar 
we  have  been  discussing  widiout  any  modification,  since  die  constraint  relations  and  their 
solutions  can  simply  be  incorporated  into  the  algorithm’s  empty  symbols  table. 

For  a  non-toy  domain,  however,  this  increases  the  size  of  the  parse  tables  intoloably. 
For  our  first  implementation  of  constraint  relations  for  semantic  interpretation,  therefore, 
we  modified  the  algorithm  so  that  it  treated  empty  constraint  relation  symbols  specially. 
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not  expanding  them  when  the  parse  tables  were  built  but  instead  waiting  until  parse  time 
where  it  solved  them  top-down  through  a  process  that  might  be  thought  of  as  a  kind  of 
all-paths  non-backtracking  Prolog. 

A  problem  still  arose  when  constraint  relatitms  received  traces  as  arguments.  Until  a 
trace  is  bound,  it  of  course  contains  very  little  information,  and  hence  unifies  with  almost 
any  constraint  relation  solution.  Since  bottom-up  parsing  often  hypothesizes  traces,  there 
is  a  consequent  combinatmial  explosion  which  can  lead  to  slow  parsing. 

The  obvious  solution  to  diis  problem  was  simply  to  def^  the  atmnpt  to  solve  ctmstraint 
relations  until  the  point  in  the  parse  where  they  have  received  adequate  instantiation.  The 
definition  of  “adequate”  clearly  differs  from  constraint  relation  to  constraint  relation:  in 
the  case  of  PREDICAHVE-PP  it  might  be  diat  the  preposition  and  class  of  NP  object  be 
known.  Until  constraint  relations  are  sufficiently  instantiated  to  bother  solving,  they  can 
simply  be  carried  as  extra  riders  on  chart  edges,  being  passed  up  as  new  edges  are  built 

This  solution,  too,  was  inadequate,  since  it  required  either  manually  maridng  the  ccm- 
straint  relations  whose  solutions  needed  to  be  deferred  or  delaying  the  solutitm  of  all 
constraint  relations.  Our  ultimate  solution  to  the  problem  of  using  constraint  relations  was 
two-fold: 


•  Restructuring  the  grammar  where  possible  so  that  PPs  and  similar  modifiers  and  their 
associated  constraint  relations  were  introduced  at  a  pmnt  in  which  all  the  infcmnation 
available  for  a  solution  was  available.  Our  use  of  recursive  structures  in  NP  (Section 
2.3)  is  an  example  of  one  such  restructuring. 

•  Utilizing  general  interpretation  principles  diat  would  only  attempt  to  construct  a  se¬ 
mantic  formula  for  a  given  constituent  at  a  point  in  the  parse  when  all  rwcessaiy 
information  was  available.  The  mapping  units  and  general  labelled  argument  mech¬ 
anisms  discussed  in  Sections  3.4  and  3.5  are  examples  of  this. 
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SMNSIAn>SAXOHOICr(top) 

C 


CdrJtmCTZOII-CLltSSBS  :«  SBGaZARZZS  (top) 

nUOrSXATXOHS  :«  RANSZARCtop) 

(fox  pairing  in  COHJOMCROM-CEASSBS 
do  tap  :■>  :JJnr 

(for  elaaa  in  pairing [2] 

do  :«  (nncrf(ZltMfSZATZOHS (elaaa), t^}) 
TRMrSZATZ<W8(pairing[l])  :«  ta^) 

RANSIATZOKS 

1 

TRAHSLASB  (oonoapt)  : 

t 

Dzsjozmnss-cxAssss  :« 

(PZCK-ASBZntMtY-OSDSR 

(SET  a  (20IIER  (CBZIfiltn  eonoopt)) 

(AMD  (HOH-XMPTY  a) 

(FQCtAXX  X  a  (VORAU.  y  a  (->  (MOT  («  x  y)) 

(DISJOIMT  X  y)  )  )  )  ) )  ) 


(for  elaaa  in  DXSJOXMTSBSS-CXASSBS 
do  (for  anb-coneapt  in  elaaa 

do  (for  trana  in  (RAHSIASB  aiA-ooooapt) 
do  (ntAHSLATZOMS  trana [1])  :« 

(DHZrY  (COHS  eonoopt 

(for  elaaa'  in  DISJOZHTIBSS-CXASSES 
ooUact  (if  («  elaaa  elaaa') 
trana [2] 

:1»Y))) 

(ntAHSLATZOMS  trana[l] ) ) ) ) ) 


ntAHSLATZOMS  (OOlMOpt)  :« 

(COMS  eonoopt  (for  elaaa  in  DZSJOZHTHSSS-CLASSBS  oollaot  lAHY) ) 

ntAHSLATZOMS 

J 


Figure  3.8:  Translation  Algorithm 


Chapter  4 


Parsing 


In  this  chapter  we  examine  the  changes  in  the  processing  strategies  fcr  using  grammatical 
knowledge  in  DELPHL  We  first  begin  by  examining  tite  introduction  of  procedural  ekmmits 
into  our  grammatical  fonnalism,  to  inoease  efficiency.  Next,  we  describe  tiie  use  of 
statistical  training  techniques  to  transfonn  our  parser  from  aU-patiis  to  best-first  Hnally, 
we  detail  the  use  of  fragment-processing  ai^  frame-based  fall-back  techniques  to  handle  iU- 
formed  and  extra-grammatical  utterances.  We  can  kx^  at  ttese  techniques  as  maximizing 
the  performance  of  the  system  along  the  following  line: 


•  procedural  elements  increase  the  efficiency  of  the  use  of  die  grammar  as  a  whole 

•  statistical  techiuques  optimize  die  use  of  the  most  common  rules  of  die  grammar 

r 

•  fallback  strategies  allow  the  system  to  handle  extra-grammatical  ctmstructirms,  whedier 
they  are  due  to  limited  coverage,  ill-formedness,  or  recognition  errors 

4.1  Introducing  Procedural  Elements  into  Unification  Parsing 


Unification  grammars  based  cm  complex  feature  structures  are  dieotetically  well-founded, 
and  their  declarative  nature  facilitates  exploration  of  various  parsing  strategies.  However, 
a  straightforward  impleimntation  of  such  parsers  can  be  painfully  inefficient,  exploding 
lists  of  posabilities,  and  failing  to  take  advantage  of  search  ccmtrol  mediods  long  utilized 
in  more  procedurally-oriented  parsers.  In  the  context  of  BBN’s  DELPHI  system,  we  have 
explored  modificaticms  diat  gain  procedural  efficiency  widiout  sacrificing  the  dieoietical 
advantages  of  complex  feature-based  grammars. 
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One  class  of  changes  was  to  introduce  varieties  of  structure  sharing  or  “folding”  to  con¬ 
trol  combinatorics.  One  kind  of  sharing  was  achieved  automatically  be  partially  combining 
similar  grammar  rules  in  the  tables  used  by  the  parser.  Another  result^  from  introducing 
a  strictly  limited  form  of  disjunction  that  grammar  writers  could  use  to  reduce  the  number 
of  separate  rules  in  the  granunar. 

The  other  class  of  changes  introduced  procedural  elenwnts  into  the  parsing  algoridim  to 
increase  execution  speed.  The  major  change  here  was  adding  partial  laediction  based  mi  a 
procedurally-tractable  and  linguistically-motivated  subset  of  grammar  features.  Appnqiriate 
choice  of  the  features  on  which  to  base  the  predictitm  allowed  it  to  cut  down  substantially 
the  space  that  needed  to  be  searched  at  runtime.  During  this  first  phase  of  our  work,  we 
also  explored  die  use  of  non-unification  computation  techniques  for  certain  subtasks,  where 
the  nature  of  the  computation  is  such  that  ^proaches  other  than  unification  are  significantly 
faster  but  can  still  be  integrated  effectively  into  an  overall  unification  framework. 

Together,  the  classes  of  changes  discussed  here  resulted  in  iqi  to  a  40-fold  reductimi  in 
the  amount  of  structure  created  by  the  parsm  for  certain  sentences  and  an  average  S-fold 
parse  time  speedup  in  the  BBN  DELPHI  system. 


4.1.1  Folding  of  Similar  Stnictiu^s 

Major  improvements  were  obtained  by  partially  combining  similar  structures,  eidim  au¬ 
tomatically,  by  combining  common  elements  of  rules,  or  manually,  tiirougb  the  use  of  a 
limited  form  of  disjunction. 


Automatic  Rule  Folding 


The  goal  here  was  to  provide  an  automatic  process  that  would  take  advantage  of  the  large 
degree  of  similarity  frequently  found  between  different  rules  in  a  unificatitm  grammar  by 
overlapping  their  storage  or  execution  patiis.  While  die  modularity  and  declarative  nature 
of  such  a  grammar  are  weU-served  by  representing  each  of  a  frunily  of  related  possibilities 
with  its  own  rule,  staring  and  testing  arid  instantiating  each  rule  separately  can  be  quite 
expensive,  ff  common  rule  segments  could  be  automatically  identified,  diey  could  be 
partially  merged,  reducing  both  the  storage  space  for  them  and  die  ccnnputaticmal  cost  of 
matching  aginst  them. 

We  implemented  a  scheme  that  combines  rules  with  equivalent  first  elements  on  their 
right  hand  sides  intti  rule-groups.  This  kind  of  equivalence  between  two  rules  is  tested 
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by  taking  the  first  element  of  die  right  hand  side  of  each  rule  and  seeing  if  they  mutually 
subsume  each  odier,  meaning  that  they  have  die  same  informatitHi  content  If  so,  the 
variables  in  the  two  rules  are  renamed  so  that  the  variables  in  the  first  elements  rni  their 
right  hand  sides  are  named  the  same,  meaning  that  those  elements  in  the  two  rules  become 
identical,  although  the  rest  of  die  ri^t  hand  sides  and  the  left  hand  sides  may  still  be 
different  This  equivalence  relatitm  is  used  to  divide  the  rules  into  equivalence  classes  with 
idendcal  first  right  hand  skk  elements. 

Before  diis  scheme  was  adopted,  die  parser,  working  bottran  up  and  having  discovered  a 
constituent  of  a  particular  type  covering  a  particular  substring,  would  individually  test  each 
rule  whose  right  hand  side  tegan  with  a  ctmstituent  of  that  category  to  see  if  dus  element 
could  be  the  first  one  in  a  larger  constituent  using  that  rule.  For  example,  each  of  the  doren 
VP  rules  the  grammar  contains  that  begin  with  a  V  would  have  to  be  separately  matched.  (In 
the  earlier  version  of  the  grammar,  before  the  miqiping  unit  approach  to  subcategorization 
was  adopted,  this  would  have  been  <m  the  mider  of  80  VP  rules.)  After  adding  rule-groups, 
the  parser  mily  needs  to  match  against  the  conmum  first  ri^t  side  element  for  each  group. 
If  that  unification  fails,  none  of  die  rules  in  the  group  are  applicable;  if  it  succeeds,  the 
resulting  substimtion  list  can  be  used  to  set  up  continuing  configurations  for  each  of  the 
rules  in  die  group.  Use  of  the  rule-groups  scheme  colhqised  over  270  rules  out  of  die  453 
of  the  full  grammar  into  approximately  70  rule-groups,  meaning  that  each  single  test  of  a 
rule-group  on  the  average  did  the  work  of  testing  between  three  and  four  individual  rules. 
Some  rule  groups  were  much  larger,  however,  such  as  those  whose  left  most  constituent 
was  V  (12),  DETERMINER  (12),  or  NP  (19). 

Note  that  the  kind  of  efficiency  added  by  this  use  of  rule-groups  closely  resembles 
that  found  in  ATN’s,  which  can  represent  the  common  prefixes  of  many  parse  padis  in 
a  single  state,  with  the  paths  only  divoging  when  different  tests  or  actions  are  required. 
But  because  this  process  here  occurs  as  a  compilatitm  step,  it  can  be  added  without  losing 
any  of  the  original  grammar’s  modularity  and  clarity.  While  similar  goals  could  also  be 
achieved  by  rewriting  the  grammar,  for  example,  by  creating  a  new  constituent  type  tt> 
represent  the  shared  element,  such  changes  would  make  the  grammar  less  perspicuous  and 
less  easy  to  extend.  Under  our  regime,  the  grammar  writer  is  left  free  to  follow  linguistic 
motivations  in  detemuning  constituent  structure  and  the  like,  while  the  system  take  care  of 
restructuring  die  grammar  to  allow  it  to  be  executed  more  efficiently. 


Limited  Disjunction 


While  in  some  circumstances  it  seems  best  to  write  independent  rules  and  allow  die  system 
to  discover  the  common  elements,  there  are  other  cases  where  it  is  better  for  the  grammar 
writer  to  be  able  to  collapse  similar  rules  using  disjunction.  That  explicit  kind  of  disjunctitm, 
of  course,  has  the  same  obvious  advantage  of  allowing  a  single  rule  to  express  ^xdiat 
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otherwise  would  take  n  separate  rules,  which  would  have  to  be  matched  against  separately 
and  which  could  add  an  additional  factor  n  ambiguity  to  all  the  structures  build  by  die 
parser.  For  example,  die  agreement  features  for  a  verb  like  **1181”  used  to  be  eiqnessed  in 
BBN’s  DELPHI  system  using  five  separate  rules: 


(V  (AGR  (18T)  (SMC)) 
(V  (A6R  (2K»  (SRC)) 
(V  (AGR  (1ST)  (FL)) 
(V  (AGR  (2ND)  (PL)) 
(V  (AGR  (3RD)  (PL)) 


...)  (l±«t) 

...)  (li«t) 
..)  -»  (lint) 
..)  -♦  (lint) 
..)  -»  (lint) 


That  information  can  be  expressed  in  a  single  ditgunctive  rule: 


(V  (:OR  (AGR  (1ST) 
(AGR  (2HD) 
(AGR  (1ST) 
(AGR  (2IID) 
(AGR  (3BD) 


(SN6)) 

(SNG)) 

(PL)) 

(PI.)) 

(PL)))  .,.)  (lint) 


Many  researchers  have  e^qiloted  adding  disjunction  to  unification,  ddier  for  grammar 
compactness  or  for  the  sake  of  increased  efBciency  at  parse  time.  The  former  goal  can  be 
met  as  in  Definite  Qause  CSrammars  [74]  by  allowing  disjunction  in  the  grammar  formalism 
but  multiplying  such  rules  out  into  disjunctive  normal  form  at  parse  time.  However,  making 
use  of  disjunction  at  parse  time  can  make  die  unification  algotidim  significandy  more 
complex.  In  Karttunen’s  scheme  [52]  for  PATR-H,  the  result  of  a  unification  invtdving  a 
disjunction  includes  '"constraints”  diat  must  be  carried  akmg  and  tested  after  each  furtho' 
unification,  to  be  sure  that  at  least  cme  of  die  disjuncts  is  still  possible,  and  to  remove 
any  others  that  have  become  impossible.  Kasper  [54],  while  showing  that  the  consistency 
problem  for  disjunctive  descriptkms  in  NP<complete,  proposed  an  approat^  whose  average 
axnplexity  is  controlled  by  sqiarating  out  die  disjunctive  elements  and  posQxming  dieir 
expansion  or  unification  as  long  as  possible. 


Rather  than  pursuing  efficient  techniques  for  handling  full  disjunctkm  widiin  unifica¬ 
tion,  we  have  taken  quite  a  different  tack,  defining  a  very  limited  form  oi  disjunction  diat 
can  be  implemented  without  substantially  complicating  die  normal  unification  algoridim. 
The  advantage  of  this  approach  is  diat  we  already  know  that  it  can  be  implemented  widioot 
significant  loss  of  efficiency,  but  the  questitxi  is  whether  such  a  limited  version  of  disjunc- 
tirni  will  turn  out  to  be  sufficiendy  powerfiil  to  encode  die  phenomena  diat  seem  to  call 
for  it.  Our  experience  seems  to  suggest  that  it  is. 
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Much  of  die  complexity  in  unifying  a  structure  againiit  a  disjunction  arises  only  whmi 
more  than  one  variable  ends  up  being  bound,  so  diat  dependencies  between  possible  instan- 
tiatiems  of  the  variables  need  to  be  rememboed.  For  example,  the  result  of  die  following 
unification 


?A6Rn  (:0R  (JtGR  (21ID)  (S»S)) 
(A6R  (21IID)  (PL) ) 

(AGR  (3ia»  (PL) )  ) 


(where  **?”  marks  variables  and  *Tl”  means  unificadem)  can  easily  be  represented  by  a 
substitution  list  that  binds  ?A6R  to  the  disjunction  itself,  but  the  following  case 


(AGR  ?P  ?N)  n  (:OR  (AGR  (2ND)  (SN6) ) 
(AGR  (2ND)  (PL) ) 

(AGR  (3RD)  (PL)  )  ) 


requires  that  the  values  given  to  ?P  and  ?N  in  the  substitution  list  be  linked  in  some 
way  to  record  their  interdependence.  In  particular,  it  seemed  that  if  we  never  allowed 
variables  to  occur  inside  a  disjunctitm  nor  any  structure  containing  more  than  one  variable 
to  be  matched  against  one,  then  the  result  of  a  unificatiem  would  always  be  expressible  by 
a  single  substitution  list  and  that  any  disjunctions  in  duit  substitutitxi  list  wmild  also  be 
only  disjunctions  of  constants.  Thus  we  required  that  disjuncts  contain  no  variables,  and 
that  the  value  matched  against  a  disjunction  either  be  itself  a  variable  or  else  contain  no 
variables. 

However,  enforcing  the  restriction  against  unifying  disjunctions  with  multi-variable 
terms  turns  out  to  be  more  complex  than  first  ^ipears.  It  is  not  sufficient  to  ensure,  while 
writing  the  grammar,  that  any  element  in  a  rule  that  will  direedy  unify  with  a  disjunctive 
term  be  either  a  constant  term  or  a  single  variable,  since  a  single  variable  in  the  rule  that 
direedy  matches  the  disjunctive  element  might  have  already,  by  die  operatimi  of  other 
rules,  bec(»ne  partially  instantiated  as  a  structure  containing  variables,  and  thus  (me  diat 
our  limited  disjunction  facility  would  not  be  able  to  handle. 

For  example,  if  the  disjunctive  agreement  structure  for  "list”  cited  above  (xrcurs  in  a 
clause  with  a  subject  NP  whose  agreement  is  (AGR  (3RD)  (PL) ) ,  and  in  a  cemtaining 
VP  rule  that  merely  identifies  the  two  values  by  binding  them  both  to  a  single  variable, 
the  conditions  for  our  constrained  disjunction  are  met  Howeva,  if  the  subject  NP  turns 
out  to  be  pronoun  with  its  agreement  represented  as  (AGR  ?P  ?N) ,  the  cemstraint  is  no 
longer  met 
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This  problem  with  our  fast  but  limited  disjunction  turned  up,  for  example,  when  a 
change  in  a  clause  rule  caused  agreement  features  to  be  unbound  at  the  point  where  dis¬ 
junctive  matching  was  encountered.  The  change  was  introduced  to  allow  for  queries  such 
as  “Do  United  or  American  have  flights  ...  ”,  where  the  agreement  between  the  ctmjoined 
subject  and  the  verb  does  not  follow  the  normal  rules.  The  solution  to  that  problem,  as 
discussed  above  in  Section  2.2  was  to  introduce  a  constraint  node  [20]  pseudo-constiment, 
placed  by  convention  at  the  end  of  the  rule,  to  compute  the  permissible  combinations  of 
agreement  values,  with  the  values  chained  from  the  subject  through  this  constraint  node  to 
the  VP.  Unfortunately,  because  of  the  placement  of  that  constraint  node  in  the  rule,  this 
meant  that  the  agreement  features  were  still  unbound  when  the  VP  was  reached,  which 
caused  our  disjunctive  uitification  to  fail. 

In  our  current  implementation,  the  grammar  writer  who  makes  use  of  disjunction  in 
a  rule  must  also  ensure  that  the  combinaticm  of  that  rule  with  the  existing  grammar  and 
the  known  parsing  strategy  will  still  maintain  the  property  that  any  element  to  be  unified 
with  a  disjunctive  element  will  be  either  a  constant  or  a  single  variable.  Mistakes  result  in 
errors  flagged  during  parsing  when  such  a  unification  is  attempted.  We  are  not  h^py  with 
this  limitation.  Nevertheless,  the  result  of  our  work  so  far  has  been  a  many-fold  reduction 
in  the  amount  of  structure  generated  by  the  parser  without  any  significant  increase  in  the 
complexity  of  the  unification  itself. 


4.1.2  Search  Space  Reduction 

A  second  class  of  changes  in  addition  to  those  that  fold  together  similar  structures  are 
changes  in  the  parsing  algorithm  that  reduce  the  size  of  the  search  space  that  must  be 
explored.  The  major  type  of  such  control  was  a  form  of  prediction  from  left  context  in  the 
same  sentence. 


Prediction 


A  form  of  prediction  for  context  free  grammars  was  described  by  Graham,  Harrison,  and 
Ruzzo  [39]  that  was  complete  in  the  sense  that,  during  its  bottom-up,  left-to-right  parsing, 
their  algorithm  never  tries  to  build  derivations  at  word  w  using  rule  R  unless  there  is  a 
partial  derivation  of  the  left  context  from  the  beginning  of  the  sentence  up  to  word  id  —  1 
that  contains  a  partially  matched  rule  whose  next  element  can  be  the  root  of  a  tree  with  R 
on  its  left  frontier.  This  is  done  by  computing  at  each  position  in  the  sentence  die  set  of 
non-terminals  that  can  follow  partial  derivations  of  the  sentence  up  to  that  point 
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While  this  style  of  prediction  woiirs  well  for  context  firee  grammars,  simple  extension  of 
that  method  to  complex  feature-based  grammars  foundm^  due  to  the  size  of  the  prediction 
tables  required,  since  a  separate  entry  in  the  prediction  tables  needs  to  be  made  not  just 
for  each  major  category,  but  also  for  every  distinct  set  of  feature  values  that  occurs  in 
the  grammar.  One  alternative  is  to  do  only  partial  prediction,  ignoring  some  or  all  of 
the  feature  values.  (The  equivalent  for  context  free  grammars  would  be  predicting  tm  the 
basis  of  sets  of  non-termiruds.)  This  reduces  the  size  of  the  prediction  tables  at  the  cost  of 
reducing  the  selectivity  of  the  predictions  and  dius  relatively  increasing  the  space  that  will 
have  to  be  searched  while  parsing. 

The  amount  of  selectivity  available  from  prediction  based  on  particular  sets  of  feature 
values  depends,  or  course,  on  the  structure  of  the  particular  grammar.  In  the  BBN  DELPHI 
system,  we  found  that  prediction  based  on  major  category  only,  ignoring  all  feature  values, 
was  only  very  weakly  selective,  since  each  major  category  predicted  almost  the  fuU  set 
of  possible  categories.  Thus,  it  was  important  to  make  the  prediction  sensitive  to  at  least 
some  feature  values.  We  achieved  the  same  effect  by  splitting  certain  categories  based  on 
the  values  of  key  features  and  on  context  of  q>plicability.  Prediction  by  categories  in  this 
adjusted  grammar  did  significantly  reduce  the  search  space. 

For  example,  our  original  grammar  used  the  single  category  symbol  S  to  represent 
matrix  (or  root)  clauses,  subordinate  clauses,  and  relative  clauses.  This  had  the  effect  on 
prediction  that  any  context  which  could  predict  a  sub-clause  ended  up  predicting  both  S 
itself  and  everything  that  could  begin  an  S,  which  meant  in  practice  almost  all  the  cate¬ 
gories  in  the  grammar,  since  virtually  any  category  can  appear  as  the  initial  constituent  of  a 
matrix  clause.  By  dividing  the  S  category  into  three — ^ROOT-S  for  matrix  clauses,  REL-S 
for  relative  clauses,  and  S  for  all  other  subordinate  clauses — ^we  were  able  to  limit  such 
promiscuous  prediction.  Furthermore,  the  distinction  between  root  and  non-root  clauses 
is  weU  established  in  the  linguistic  literature  (see  Edmcmds  [36]),  and  relative  clauses  are 
known  to  behave  differently  than  other  subordinate  clauses  with  respect  to  various  extrac¬ 
tion  phenomena  [79,  80].  Having  this  distinction  encoded  via  separate  category  symbols, 
rather  than  through  subfeatures,  allows  us  to  more  easily  separate  out  the  phenomena  that 
distinguish  the  three  types  of  clauses.  For  example,  root  clauses  express  direct  questions, 
which  are  signalled  by  subject-aux  inversion  (“Is  that  a  non-stop  flight?”)  while  subcmlinate 
clauses  express  indirect  questions,  which  are  signalled  by  “if’  or  “whetiier”  unaccompa¬ 
nied  by  subject-aux  inversion  C*I  wonder  if/whether  that  is  a  non-stop  flight”)  And  relative 
clauses  and  indirect  questions  use  different  forms  of  WH-pronouns;  relative  clauses  also 
allow  for  forms  without  an  overt  WH  pronoun  at  all,  which  is  impossible  in  any  scvt 
of  question,  direct  or  indirect.  Thus  it  seems  that  the  distinction  needed  for  predictive 
precision  at  least  in  this  case  was  also  one  of  more  general  linguistic  usefulness. 

We  can  look  at  the  division  of  the  former  S  category  into  ROOT-S,  REL-S,  and  S  as  a 
species  of  hand-compilation.  It  should  be  pointed  out  that  the  division  of  S  into  three  sub- 
types  has  another  small  effect  on  efficiency.  Before  the  division,  both  matrix,  relative,  and 
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subordinate  clauses  had  to  carry  a  full  range  of  features  mood,  tense,  trace,  etc.,  some 
of  which  are  predictable  for  root  clauses  and  which  need  not  be  tested  against  Widi  the 
division  of  S  into  RCK)T-S,  REL-S,  and  S,  many  of  the  former  features  have  been  removed 
from  R(X>T-S  and  REL-S,  requiring  few^  unifications  to  be  performed  whenever  a  clausal 
utterance  or  relative  clause  is  built,  which  is  quite  often.  Here  are  tlw  current  spedficaticms 
of  the  features  for  these  three  categories; 


(ROOT-S  C0MPFL6  WB  COtlJFlMS) 

(BEL-8  8CNS-P  TRACS  COtUFLAG) 

(S  CCMPFL6  MOOD  MB  TRACS  TRACS  C0NJTXA6) 


All  three  categories  cmitain  CONJFLAG,  which  indicate  whether  the  clause  is  a  ctmjunc- 
tion  or  not  C0MPFL6  indicates  what  complementizer  introduces  the  clause.  For  ROOT-S, 
this  can  either  be  null  or  “what  if’  for  conditionals.  For  S,  it  can  either  be  null  or  “that”  for 
indirect  statements,  or  some  WH  pronoun  or  “if’  or  “whether”  for  indirect  questions.  This 
feature  must  be  passed  up  for  ROOT-S,  in  order  to  derive  the  cmrect  semantic  inteiineta- 
tion;  and  for  S,  since  different  verbs  selecting  clausal  complements  put  different  constraints 
on  their  complements  (e.g.  indirect  questitm  or  statement,  “that”  required  or  ppticmal  far 
indirect  statements,  etc.)  REL-S  does  not  require  this  feature  at  all,  since  its  introductcnry 
material  is  limited  to  the  relative  complementizers  “that”  and  null  and  the  relative  prcmouns, 
and  is  not  externally  selected  at  all.  MB  appears  on  both  ROOT-S  and  S,  since  they  may 
be  eithor  statements  or  questions,  but  does  not  appear  (Hi  REL-S,  since  it  has  no  option. 
REL-S  contains  a  single  TRACE  flag,  to  bind  the  relativized  position  within  the  clause  to 
the  head  of  the  relative,  while  S  has  two,  to  provide  the  conect  links  between  extracted 
WH  phrases  and  their  traces  via  tiie  well-known  difference  list  [73]  or  “trace-threading” 
[24]  mechanism.  ROOT-S  has  no  need  fcH  tiiis  feature,  since,  by  definiticm,  there  is  no 
place  external  to  a  matrix  clause  to  bind  a  trace.  Hnally,  S  ccmtains  a  HOOD  feature  to 
allow  clausal  complement  taking  constituents  to  select  the  mtxxl  of  the  clause;  contrast  “I 
think  that  he  is  arriving”  with  “I  demand  that  he  arrive  on  time”.  ROOT-S  and  REL-S 
only  take  the  indicative  form,  so  there  is  no  need  for  them  to  bear  this  feature. 

Another  category  which  we  found  useful  to  split  was  VP.  As  was  mentioned  in  [24], 
VP  in  our  former  grammar  wa.'^  used  to  handle  die  following  types  of  (xmstituents; 


•  a  verb  and  its  complements;  “What  time  does  the  7;20  AM  fli^t  arrive  in  Dallas?” 

•  ar  infinitival  ccunplements;  “I  need  to  arrive  in  Baltimore  before  9  o'clock” 

•  an  imperative  clause:  “Show  flights  first  class  on  American  Airlines  between  Dallas 
and  Philadelphia” 
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In  our  eariier  repost,  we  had  discussed  assigning  diffemit  ccmsdtuent  l^ls  to  these 
three  constructions,  reserving  VP  fw  a  verb  and  its  complements.  During  diis  project,  we 
carried  this  plan  out,  ai^  now  have  die  diree  categories: 


VP  for  a  verb  and  its  complements. 
COMPL-VP  for  infinitival  complements. 
TOP- VP  for  impmtuives. 


VP  and  CXDMPI^  VP  share  the  same  set  of  features,  while  TOP-VP  has  a  much  smaller  set, 
since  it  does  not  need  to  agree  in  features  with  any  dominating  cmistiti'ient 

We  also  made  possessive  pronouns,  such  as  ‘‘my”  and  “her”,  members  of  the  category 
POSSESSIVE-PRONOUN;  previously,  diey  had  been  NPs. 

A  further  major  gain  in  predictive  power  occurred  when  we  made  linguistic  trace 
information  usable  for  prediction.  The  presence  of  a  trace  in  die  current  left  context  was 
used  to  control  whether  or  not  the  pr^ction  of  one  category  would  have  the  effect  of 
predicting  another,  with  the  result  of  avoiding  the  needless  e^loradon  of  parse  paths 
involving  traces  in  contexts  where  they  wctc  not  available. 

like  the  rule-groups  described  earlier,  prediction  brings  to  a  bottom-iq>  unificaticm 
parser  a  kind  of  proce^hiral  efficiency  that  is  common  in  other  parsing  formalisms,  where 
information  firom  the  left  context  cuts  down  the  space  of  possibilities  to  be  explored.  Note 
that  this  is  not  always  an  advantage;  for  parsing  fragmentary  or  ill-formed  input,  one  mi^t 
like  to  be  able  to  turn  prediction  off  and  revert  to  a  full,  bottom-up  parse,  in  order  to 
woik  with  elements  that  are  not  consistent  with  their  left  context  However,  it  is  easy 
to  parameterize  this  kind  of  predictive  control  for  a  unification  parser,  so  as  to  benefit 
from  the  additional  speed  in  the  typical  case,  but  also  to  be  able  to  explore  die  full  range 
of  possibilities  when  necessary.  In  fact  we  have  dtme  just  diis  in  the  implementation 
of  the  syntactic  fragment  combiner  discussed  below  in  Section  4.3.  When  prediction 
fails  in  fallback  mode,  we  revert  to  a  full,  bottom-up  parse,  based  on  the  constituents 
found:  essentially,  we  scan  frcmi  left  to  ti^t  looking  for  elements  such  as  prepositiaas 
and  determiners  that  tend  to  be  good  indicators  of  the  left  margin  (ff  a  constituent  Vfe 
then  proceed,  returning  to  our  use  of  prediction,  with  possible  reinvocatitm  of  bottom-up 
information  if  we  cannot  find  any  amstituentfs)  that  span  the  remainder  of  die  input  wdiile 
remaining  consistent  with  left-context 
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Non-Unification  Computations 


We  also  integrated  non-unification  computations  into  the  parser,  where  these  can  still  pro¬ 
vide  the  order-independence  and  other  theoretical  advantages  of  unification.  There  are 
some  subproblems  that  need  to  be  solved  during  parsing  that  do  not  seem  well-suited  to 
unification,  but  for  which  there  are  established,  efficient  solutions.  As  was  noted  in  Sectitm 
2.2.1,  as  our  technique  for  handling  such  problems,  we  have  implemented  a  mechanism  diat 
allows  us  to  write  ccmstraint  nodes  to  be  compiled  into  LISP  code.  This  ability  is  crucial 
to  the  greater  part  of  the  woik  we  have  done  to  change  our  grammar  from  a  fairiy  straight¬ 
forward  implementaticxi  of  Definite  Clause  Grammar  to  a  device  for  transducing  between 
written  or  spoken  input  Without  the  ability  to  produce  certain  solutions  via  computation, 
we  would  have  been  unable  to  implement  the  labelled  argument  architecture  describe  in 
Sections  2.4  and  3.4. 


4.2  The  Use  of  a  Statistical  Agenda  in  Parsing 

4.2.1  The  GHR  Algorithm 


In  a  Graham/Hanison/Ruzzo  (GHR)  parser  [39],  such  as  was  used  in  the  initial  implemen¬ 
tation  of  DELPHI  [24],  a  chart  is  used  to  maintain  a  record  of  syntactic  constituents  that 
have  been  found  (tenns)  and  grammatical  rules  that  have  been  partially  matched  (dotted 
rules).  Parsing  strategies  such  as  GHR,  CTCY  [S3]  and  other  algoritiuns  can  be  viewed  as 
methodical  ways  of  filling  the  chart  which  are  guaranteed  to  explore  all  possible  extensions 
of  dotted  rules  by  terms. 

An  agenda  is  an  alternative  chart-filling  algorithm  with  the  goal  of  finding  some  term 
covering  the  entire  input  without  necessarily  filling  in  all  of  the  chart  If  terms  can  be 
ranked  by  “goodness”  and  the  grammar  can  produce  multiple  analyses  of  a  given  string, 
then  one  goal  for  an  agenda  is  to  produce  the  “best”  parse  first 

In  order  to  speed  up  processing  time  we  have  chosen  to  use  the  agenda  mechanism 
to  reduce  the  search  necessary  to  produce  acceptable  (see  below)  parses.  This  results  in 
sparsely  populated  charts,  approaching  the  extreme  (and  probably  unattainable)  goal  of 
deterministic  parsing,  in  which  the  only  terms  and  dotted  rules  entered  into  the  chart  are 
those  which  appear  as  parts  of  the  final  parse. 


The  techniques  involved  in  statistical  agenda  parsing  allow  “low  probability”  rules  to  be 
added  to  a  grammar  without  significant  cost  in  terms  of  eitiier  erroneous  parses  or  increased 
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parse  time.  These  low  probability  rules  greatly  increase  the  coverage  and  robustness  of  die 
system  by  accounting  fear  unusual  or  marginal  constructions. 


4.2  DELPHI  Agenda  Parsing 


Most  techniques  for  search  space  reduction  involve  careful  tuning  of  the  grammar  or  the 
parsing  mechaiusm.  This  is  very  labor  intensive  and  can  place  limits  cm  the  grammatical 
coverage  of  the  system  [1].  Our  qiproach  is  to  use  an  automated  statistical  technique  for 
ranking  rules  based  on  their  use  in  parsing  a  training  set  with  the  same  grammar  (under 
the  control  of  an  all-paths  GHR  parser  without  human  supervision). 

This  approach  also  allows  us  to  include  grammatical  rules  that  are  of  use  tmly  rarely, 
or  in  specialized  domains,  and  to  learn  how  iqiplicable  they  are  to  a  body  of  sentences. 
To  take  into  account  general  linguistic  tendencies,  we  augment  the  statisti(»l  ranking  by  a 
small  number  of  genoral  agenda  ordering  strategies. 

The  DELPHI  agenda  mechanism  is  based  cm  three  “schedulable"  action  types: 


1.  the  insertion  of  a  term  into  the  chart, 

2.  the  insertion  of  a  dotted  rule  into  the  chan,  and 

3.  the  (conditional)  **pair  extension”  of  a  dotted  rule  by  a  term. 


In  principle  erne  would  like  to  order  dK>se  actions  in  terms  of  the  probability  that 
they  lead  to  a  final  parse.  The  initial  implementation  of  the  agenda  mechanism  uses  an 
approximation  to  this  ordering. 


4.2  J  Use  of  Statistical  Measures 


There  are  two  types  of  measures  that  one  might  estimate  to  help  the  agenda  parsing  mecdi- 
anism.  They  are  (1)  categeny  expansion  probabilities  aixi  (2)  rule  success  probabilities. 
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Category  Expansion  Probabilities 


Category  expansitm  probabilities  are  peih2q)s  the  more  obvious  of  the  two  measures.  The 
goal  is  to  determine  the  probability  that  a  given  syntactic  category  (e.g.,  NP)  is  expanded 
by  a  given  grammar  rule  in  a  valid  parse. 

These  probabilities  allow  one  to  estimate  the  pr(4)ability  that  a  given  tree  is  the  expan¬ 
sion  of  a  given  category.  Bayes’  rule  may  be  used  to  calculate  the  relative  probabilities  of 
various  parse  trees  for  a  specified  input  string. 


Rule  Success  Probabilities 


Using  rule  success  probabilities,  the  goal  is  to  determine  the  probability  that  a  term  inserted 
into  the  chart  by  a  particular  rule  will  be  part  of  a  final  parse. 


Tlraining 

In  order  to  train  the  agenda  mechanism,  a  set  of  sentences  is  parsed  using  the  all-paths 
GHR  parser  and  their  charts  are  analyzed. 

For  each  rule  (Ji)  in  the  grammar  we  determine  three  numbers: 


1.  NTiR),  the  number  of  tnms  in  the  charts  based  on  that  rule. 

2.  NDR(R),  the  number  of  dotted  rules  initiated  in  the  chart  based  cm  that  rule. 

3.  NGTiR),  the  number  of  “good  terms”  based  on  that  rule,  ones  that  are  constituents  of 
an  acceptable  parse  (e.g.,  ones  leading  to  executable  database  commands  for  AUS). 

For  each  category  C  in  the  grammar,  we  calculate  one  number 

4.  NGTiC),  the  number  of  terms  with  that  category  which  are  constituents  in  an 
acceptable  parse. 


The  ratio  is  an  estimate  of  the  probability  that  a  term  based  on  R  will  appear  in 

the  final  parse,  and  is  an  estimate  of  the  probability  that  the  initiation  of  a  dotted 

rule  based  on  R  will  lead  to  a  good  term.  (Note  that  in  DELPHI,  each  wad  sense  is 
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treated  as  if  it  were  a  separate  grammar  rate,  and  so  this  mechanism  takes  into  (»ccannt  the 
relative  likelihood  of  various  word  senses  in  the  training  set) 


If  C(Ji)  is  die  category  produced  by  the  rate  Jt,  dien  the  category  expansitm  probability 


Prdimiiiary  Results  for  Differmit  Meaaires 


Using  rale  success  probabilities  leads  to  substantial  reductkm  (a  factor  of  imne  than  3)  in 
chart  size.  In  general,  one  might  expect  that  better  estimates  of  such  probabilities,  based 
on  category  expansion  probabilities  in  the  tree  below  the  term,  would  lead  to  improved 
results,  even  though  these  estimates  require  somewhat  more  computadon  than  rale  success 
probabilities  alone. 

We  have  compared  the  use  of  category  eiqiansitMi  probabilities  with  the  use  of  rale 
success  probabilities  in  several  variadons  of  the  agenda  mechanism,  and  have  found  that 
rule  success  probabilities  produce  superior  results,  although  die  reasons  for  diis  are  not 
entirely  clear. 

An  experiment  using  category  expansion  probabilities  alcme  led  to  larger  charts  than 
produced  by  the  use  of  rule  success  probabiUdes  in  isoladcm.  Combining  category  expan¬ 
sion  probabilities  with  rale  success  probabilities  ^ipeared  to  be  no  better  dian  just  using 
the  rale  success  probabilides. 


4.2.4  Agenda  Structures 


The  structure  of  the  agenda  mechaiusm  appears  to  be  as  important  as  die  stadsdcal  measures 
used  to  order  agenda  items.  Experience  widi  probabilistic  ageiuias  in  speech  processing 
would  suggest  an  iqiproach  in  which  all  infOTtnadtHi  relevant  to  ordering  is  combined  into  a 
single  numeric  measure  and  used  to  order  a  single  queue.  In  principal,  diis  allows  different 
measures  to  interact  and  for  strength  in  one  measure  to  make  up  for  weakness  in  another. 

We  experimented  with  this  approach  in  a  system  whkh  had  a  single  agenda  in  which  all 
three  of  the  schedulable  acdon  types  described  above  were  placed.  Hie  stadsdcal  measures 
described  above  were  combined  in  a  weighted  fashion  with  piiorides  based  on  the  size  of 
the  consdtuents,  the  posidon  of  the  right  hand  end  of  the  constituent  and  the  action  type. 
A  number  of  experiments  were  run,  giving  different  wei^tings  to  the  different  parameters, 
but  all  of  these  expoiments  led  to  charts  that  were  20%  to  40%  larger  dian  the  altmiadve 


BBN  Systems  and  Technologies  BBN  Report  No.  7715  88 

structured  agenda  described  below. 

The  structured-agenda  approach  involves  the  creation  of  a  2-dimensional  array  of  agen¬ 
das,  as  illustrated  in  Table  4.1. 


Ri^tmost  Ent^ 

[loint 

Action  Type 

N 

N-1 

•  •  • 

1 

0 

Pairs 

Ai 

A4 

*  •  • 

Rules 

A2 

A, 

Terms 

As 

Ai 

Table  4.1:  Agenda 


Each  cell  of  the  array  consists  of  a  single  type  of  actitm,  e.g.  term  insertioa,  and  all  of 
the  actions  in  the  list  Aj  in  a  cell  have  tte  same  ri^tmost  end.  Within  die  cell,  die  acdcms 
in  the  list  Ai  are  ordered  by  probability  estimates. 

For  each  step,  the  first  non-empty  cell  (starting  with  Ai  and  going  in  the  order  shown 
in  Table  4.1)  is  chosen,  and  the  first  item  on  its  agenda  is  run.  This  has  the  effect  of 
reinforcing  progress  to  the  tight  through  the  input  string,  of  choosing  the  most  approjniate 
action  for  such  morion  at  each  step,  and  favoring  close  attachment  of  modifiers. 


4.2.5  DELPHI  Results 


Measurements  of  chart-size  and  time  reductions  for  BBN’s  DELPHI  grammar  running  on 
the  ATISO  and  ATIS 1  training  and  test  sets  indicate  the  improvements  possible  widi  sevmal 
variations  of  the  basic  agenda  mechanism.  For  example,  using  the  structured  agenda  on 
the  SSI  ATISO  sentences  of  training  data  from  June  1990,  the  chart  size  was  reduced  by  a 
factor  of  3.24,  and  the  total  processing  time  reduced  by  a  factor  of  1.82. 

This  result  underestimates  the  improvement  gained  by  agenda  parsing,  since  stmiewhat 
mote  than  10%  of  the  “sentences”  in  the  training  data  were  ill-fonned  according  to  our 
grammar  (many  were  ill-formed  according  to  any  plausible  grammar!).  Since  a  prcqieriy 
operating  agenda  system  will  eventually  produce  the  same  chart  riiat  die  GHR  parso*  does, 
a^  since  that  entire  chart  must  be  search^  before  a  string  is  determined  to  be  unparseable, 
the  performance  of  any  agenda  mechanism  must  reduce  to  that  of  die  GHR  parser  for  such 
inputs. 


Another  set  of  experiments  was  performed  with  a  set  of  S39  “parseable”  strings  taken 
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fiom  the  combinatitMi  of  the  A71S0  (June  1990)  and  AUSl  (February  1991)  training  sets. 
For  this  set  the  speedup  was  a  factor  of  3.8  and  the  chart  size  reduction  was  well  over  3.5. 
(The  hedge  on  chart  size  reduction  is  because  data  for  the  chart  size  of  5  sentences  in  the 
GHR  parsCT  was  not  obtained,  the  charts  overflowed  available  memory  .  At  this  time  the 
ratio  of  that  chart  size  to  the  size  of  the  agenda  parser  chart  was  over  30.) 

The  introduction  of  probabilistic  agenda  parsing,  combined  with  the  applicatitm  of 
software  engineering  techniques,  has  sped  up  natural  language  analysis  considerably.  The 
average  time  for  parsing,  semantic  interpretatitm,  and  discourse  processing  in  our  original 
implementation  was  lowered  to  1.43  seconds  per  sentence,  witii  a  median  time  of  0.99 
seconds,  on  a  Sun  4/280.  The  most  recent  version  of  our  algorithm  has  comparable  or 
even  faster  times,  and  includes  not  just  tiie  natural  language  processing  stages,  but  is  an 
actual  “end  to  end”  figure  including  retrieval  of  an  answer  firom  the  application  databa». 


4.3  The  Use  of  Fragment  Processing 


In  this  section,  we  describe  the  fallback  understanding  cmnponent  of  DELPHI.  This  com¬ 
ponent  is  invoked  when  DELPHI’S  regular  chart-based  parser  is  unable  to  parse  an  input;  it 
attempts  to  come  up  with  a  parse  and  semantic  interpretation,  or  a  semantic  interpretation 
alone,  based  on  a  fragmentary  analysis  of  the  input 

The  fallback  understanding  component  consists  of  three  separate  stages,  which  are 
invoked  successively.  First,  the  Fragment  Generator  produces  a  sequence  of  fragmentary 
sub-parses  from  the  chart  state  left  over  from  die  unsuccessful  parse.  Next  two  different 
combination  modules — the  Syntactic  Combiner  and  Frame  Combine’— employ  alternative 
and  complementary  strategies  for  combining  these  fragments.  This  is  shown  in  the  diagram 
in  Figure  4.1. 

The  Syntactic  Combiner  uses  extended  grammar  rules  that  can  skip  over  intervening 
material  to  combine  constituents  in  an  attempt  to  re-construct  a  plausible  parse  of  the  input 
This  parse  can  be  a  clause  or  some  other  useful  constituent  such  as  an  imperative  VP.  A 
semantic  interpretation  for  titis  reconstructed  parse  is  autcmiatically  provided  through  die 
action  of  the  grammar  rules. 

The  Frame  Combiner  is  invented  when  die  Syntactic  Combiner  is  unsuccessful.  It 
utilizes  a  set  of  domain  dependent  pragmatic  slot-filling  schemata  that  embody  the  goals 
that  users  most  commonly  have;  fex'  example,  in  the  ATIS  domain,  such  tasks  as  finding 
a  flight  or  fare  that  satisfies  some  set  of  constraints,  or  asking  about  ground  transportation 
between  an  airport  or  a  city.  As  such,  it  determines  only  a  semantic  interpretation  and  not 
a  parse. 
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Input 


Figure  4.1:  Fallback  Processing  Architecture 

The  intent  of  this  multi-step  approach  to  fallback  processing  is  to  provide  a  smoodier 
path  between  the  accuracy  but  fira^ty  of  regular  parsing  on  the  oat  himd.  and  the  r(4>ust- 
ness  but  possible  inaccuracy  of  schemata-based  methods  <»  the  other. 

The  remainder  of  this  section  is  taken  up  with  detailed  description  of  each  component 
We  describe  the  Fragment  Generator,  Syntactic  Combiner,  and  die  Frame  Combiner  in 
detail.  Finally,  we  present  the  Fehniaiy  1992  NL  and  SLS  evaluation  test  results  for  these 
components,  separate  and  combined. 


4.3.1  The  Fn^ment  Generator 

Recall  that  DELPHI’S  grammar  rules  inccarpmate  semantic  constraint  and  interpretadcm 
components  by  associating  with  each  element  of  the  right-hand  side  a  grammatical  reladon 
label  which  keys  into  an  associated  system  of  semantic  rufes.  This  feature  means  duu  any 
term  which  is  inserted  into  the  chart  is  guaranteed  to  be  semantically  well-formed  and  to 
be  annotated  widi  one  or  more  semantic  interpretaticms. 

The  fragment  generator  generates  a  set  of  such  semantically  annotated  fragments  ficom 
the  chart  state  left  over  after  an  unsuccessful  parse.  The  algoritl^  for  generating  fragments 
from  the  chart  extracts  the  most  probable  terms  associated  with  longest  sub-strings  of  the 
input,  using  probabilities  associated  with  the  producing  rules  in  die  manno'  describe  in  the 
preceding  section. 


For  example,  the  utterance: 


h'  ■■  •  ' 
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I  want  a  flight  uhh  that  arrives  in  Boston  lefs  say  at  3  pm 


is  conventionally  unparseable  due  to  the  interpositions  “uhh”  and  “let’s  say”.  The  Fragment 
Generator  produces  the  following  set  of  four  fragments: 


S[Z  want  a  flight] 

NO-ZMTERP [uhh] 

BSL-S[that  arrivaa  in  Boston] 
lK>-ZBTBBP[lat'a  aay] 

PP[at  3  pm] 


43,2  The  Syntactic  Combiner 

The  Syntactic  Combiner  uses  a  special  set  of  grammar  rules,  called  fragment  rules,  to 
combine  diese  fragments  into  a  single  parse.  Tlusse  rules  have  the  same  form  as  rules  of 
the  regular  DELPHI  grammar  and  incorporate  semantic  constraints  and  interpretation  rules 
in  the  same  way.  But  the  method  for  applying  the  fragment  rules  differs  in  that  it  allows 
them  to  combine  constiments  even  when  these  constituents  are  separated  by  intervening 
portions  of  the  input,  or  when  they  occur  in  a  reversed  mder. 

Each  fragment  rule  is  adjunction  oriented,  in  the  following  form: 


X  -»  :h«ad  X,  : othmr-xulatlon  C 


The  following  is  an  example,  from  which  unification  features  have  been  omitted: 


VP  -*  :hMd  VP,  tpp-eoup  PP 


This  rule  says  that  an  existing  Verb  Phrase  fragment  and  an  existing  Prepositional  I%rase 
fragment  can  be  cennbined  together  to  make  a  new  >^rb  Phrase  witfi  the  original  VP  as 
head  and  the  original  PP  as  pp  complement,  jnrovided  they  satisfy  the  semantic  constraints 
associated  with  “:haad”  and  “:pp-comp”. 

The  central  operation  of  die  Syntactic  Combiner  is  adjunction.  The  example  nile 
licenses  the  Syntactic  Combiner  to  “adjoin”  one  fragment  tree  into  aiK>dier — that  is  to 
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replace  a  node  of  the  first  tree  with  a  new  node  whose  head  daughter  is  the  old  node  and 
whose  other  daughter  is  second  tree.  An  example,  using  die  rule  above,  would  be  the 
combination  of  die  two  firagments: 


PP[at  3  pm] 
VP[arrlvms  la  Boston] 


to  make  the  new  VP: 


VP[szriv«s  in  Boston  at  3  pm] 


Note  that  the  adjunction  node  does  not  have  to  he  the  top  of  the  first  fragment  tree:  it 
can  be  any  non-terminal  node,  as  in  the  following  pair  of  fragments: 


PPCst  3  pm] 

S[11P[Z] 

VP  [want 

NP[NP[a  flight] 

BEL-S[that 

VP[azzlvms  la  Boston]]]]] 


The  algorithm  that  applies  these  rules  first  scans  right  to  left  taking  each  successive 
fragment  and  looking  for  fragments  to  its  left  to  adjoin  ^  first  fragment  into.  The  search 
for  an  attachment  point  within  a  fragment  is  light-to-left,  bottcnn-up  first,  and  deterministic. 

The  reason  for  the  directional  priority  is  to  enforce  the  preference  of  firagment  rules 
that  the  sub-term  of  the  adjunctioi  be  to  die  right  of  die  head.  The  alpiridim  then  reverses 
direction,  attempting  to  adjoin  any  remaining  fiagments  into  other  fragments  (xi  dieir  right 
It  oscillates  back  and  forth  in  this  fashion  until  no  more  firagments  can  be  combined. 

At  the  end  of  this  process  the  largest  firagment  (possibly  now  containing  odier  firagments 
which  it  has  absorbed)  is  returned  as  the  reconstructed  parse,  subject  to  cut-off  restrictions 
which  we  discuss  below.  Mtme  than  one  fragment  is  returned  in  the  case  of  multiple  clausal 
fragments,  and  the  discourse  module  is  invoked  to  construct  the  intnpretaticHi  of  the  whole. 

As  a  simple  example,  let  us  return  to  the  example  of  the  previous  section: 
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I  want  a  flight  uhh  that  arrives  in  Boston  lefs  say  at  3  pm 


which  generates  the  fragments: 


S[Z  want  a  flight] 

MO-ntTBBP  [uhh] 

ItEL-S[that  axrlvaa  In  Boston] 
BO-niTBBPElat's  say] 

PP[at  3  pa] 


The  rules  that  enable  combination  of  dtese  fragments  are: 


VP-»  :haad  VP,  :pp>ooap  PP 
MP-»  :haad  BP,  :xal-clausa  BEL-S 


The  first  rule  above  licenses  the  attachment  of  “at  3  pm”  to  “arrives”  inside  the  existing 
REL-S  structure  “that  arrives”  and  the  seccmd  the  attachment  of  the  combined  REL-S 
structure  to  the  NP  “a  flight”  inside  the  clause  “I  want  a  flighf*.  After  this  combination, 
we  are  left  with  two  fragments:  a  clause  and  an  unanalyzable  partimi  of  the  string.  Since 
all  grammar  rules  in  DELPHI  include  a  semantic  interpretation  component,  as  discussed  in 
Chapter  3,  a  semantic  interpretation  of  the  clause  is  also  available. 

The  other  fragment  rules  ctmibine  NPs  and  their  various  modifiers  and  VPs  and  their 
NP  complements: 


MP-»  :hoad  BP,  ipp-coa^  PP 
BP->  :h«ad  BP,  :post~noa  BP 
BP-»  :hoad  BP,  :whlB-x«l  VP 
VP-»  :h«ad  VP,  idlxoet-objoct  BP 


The  Syntactic  CcHnbiner  uses  a  cut-off  (currently  .8)  for  the  ratio  of  die  number  of 
words  included  in  the  final  reconstructed  parse  to  the  number  of  words  of  the  original  input 
to  determine  whether  or  not  to  accept  die  final  analysis  as  plausible.  The  computation  of 
this  ratio  is  adjusted  to  ignore  a  stop  list  of  certain  wmds  that  carry  little  meaning  (“does” 
“me"  “could”  etc.)  and  to  block  interpretations  which  exclude  other  words  which  do  lend 
to  change  the  meaning  (“first”,  “most”  etc.). 
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4.33  The  Frame  Combiner 


The  Frame  Combiner  seeks  to  combine  together  not  fragments  but  the  semantic  interpre¬ 
tations  of  fragments,  and  does  so  based  not  on  grammar  rules  but  on  jntigmatic  schemata 
which  have  various  “slots”  to  fill.  It  works  primarily  with  semantic  interpretations  of  frag¬ 
ments  at  the  NP  and  PP  level.  Its  approach  is  similar  in  spirit  to  SRI’s  Template  Matcher 
[47]  but  it  differs  from  that  work  in  a  numbtf  of  important  ways. 

Most  importantly,  it  is  fully  integrated  with  a  ctmventioiud  NLU  system  including 
grammar  and  parser.  This  makes  it  possible  for  it  to  work  from  recursive  tree  fragment 
structures  instead  of  sub-strings  of  input  As  a  result  the  slot-filling  process  is  not  limited  to 
simple  phrases  such  as  “to  BWT’  but  can  also  handle  more  syntactically  and  semantically 
complex  phrases  such  as  “to  the  airport  closest  to  Washingttxi  DC”.  All  the  complex 
modifier  structure  internal  to  NPs  which  a  conventional  parser  normally  uncovers  can  be 
incorporated  into  slot-filling. 

Moreover,  while  the  system  does  not  use  larger  constituents  such  as  VPs  and  clauses 
to  fill  slots  directly,  it  does  make  use  of  a  candidate  NP  or  PP’s  occurrence  inside  such 
a  larger  phrase  to  determine  which  slot  the  candidate  should  fill.  This  enables  the  Frame 
Combiner  to  cope  with  such  cases  as  the  PP  “befrae  3  pm”,  which  means  entirely  different 
things,  and  therefore  constraints  entirely  difft^nt  slots,  depending  on  whetiier  it  modifies 
the  verb  “arrive”  or  “depart”. 

A  final  difference  is  tiiat  the  Frame  Combiner  attempts  to  determine  die  actual  items 
of  information  that  the  user  wants  to  have  presented  to  him — that  is,  what  slots  in  tite 
frame  are  being  asked  about,  as  opposed  to  filled  or  constrained.  This  last  has  practical 
importance  within  the  context  of  the  ATIS  task  domain  because  it  enables  only  what  is 
asked  about  to  be  displayed  to  the  user.  Fmmerly  it  was  sufficient  simply  to  provide 
the  entire  extension  of  a  suitably  frame  as  the  answer,  but  given  the  MIN/MAX  scoring 
procedure,  such  a  tactic  is  likely  to  result  in  numerous  wrong  answers. 

The  basic  operation  of  the  Frame  Combiner  is  to  input  a  sequence  of  semantically 
annotated  fragment  trees  and  to  output  a  logical  form  as  a  proposed  interpretatitm  of  the 
utterance.  As  intermediate  steps  it  generates  alternative  sets  of  attribute-value  “triples”  and 
filters  these  according  to  plausibility  criteria  before  generating  a  final  inteipretaticm  from 
the  chosen  set.  We  next  describe  each  of  these  steps. 
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Representational  IViples 


As  intermediate  output,  the  frame  combino'  first  produces  a  set  of  attribute-value  tr^les 
with  the  following  structure; 


koperator.  attribute,  VALUE> 


The  ATTRIBUTE  is  a  single  or  multi-valued  function.  The  VALUE  is  an  element  or  set 
of  elements  from  this  function’s  range.  The  OPERATOR  is  a  binary  relation  over  elements 
of  the  range.  In  the  following  example; 


<EQUAL.  ORIGIN-CITY.  BOSTON> 


The  operator  is  the  relation  EQUAL.  The  attribute  in  this  example  is  tiie  function 
ORIGIN-CITY,  whose  domain  is  the  class  FLIGHT  and  whose  range  is  the  class  CITY.  The 
value  in  the  example  is  the  individual  city  BOSTON. 

Other  typical  operators  are  relations  like  TIME-BEFORE  and  GREATER-THAN.  There 
is  a  special  operator,  HAS-PROPERTY,  which  is  combined  with  a  truth-valued  (Le.  one- 
place  predicate)  attribute  and  the  value  TRUE  for  adjectival  meanings  like  “mm-stop”. 

Currently  there  are  three  classes  which  can  serve  as  the  domain  of  an  attribute — 
FLIGHT,  FARE  and  GROUND-TRANSPORTATION.  We  refer  to  these  as  the  “core”  classes 
of  the  ATIS  task.  These  core  classes  are  associated,  respectively,  with  the  distinguished 
attributes  FLIGHT -OF,  FARE-OF  and  TRANS-OF,  which  we  term  tite  “explicit”  attribute 
of  the  core  class.  Explicit  attributes  are  necessary  to  incorporate  well-fbnned,  parsed 
NP  fragments  whose  semantic  type  is  one  of  die  core  classes,  such  as  “the  USAir  flight 
from  Boston  to  Denver'’,  without  having  to  break  them  up  into  tiieir  comptment  modifiers. 
Explicit  attributes  are  always  ccnnbined  witii  the  EQUAL  t^wrator  and  an  element  of  the 
domain,  and  effectively  cmrespond  to  the  identity  function  for  the  domaiiL 

An  attribute-value  triple  can  be  formally  viewed  as  a  specificatitm  of  a  subset  of  die 
domain  of  the  attribute  of  the  trqile.  While  diey  have  a  clear  relationship  to  die  notion  of 
a  template  or  frame,  they  are  perhaps  better  thought  of  as  disembodied  “slot-constraints”. 
Note  in  particular  that  a  set  of  such  triples  is  a  more  flexible  representation  dian  a  single 
template  in  that  it  can  uniformly  combine  triples  whose  attributes  have  different  domains. 
This  is  important  when  the  question  itself  concerns  more  than  me  domain — such  as  both 
FUGHTs  and  FAREs. 
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Generating  Triples 

Triples  are  produced  from  fragment  trees  using  a  recursive-descent  algorithm  that  applies  a 
set  of  pattern  rules  that  match  against  fragment  trees  and  their  attached  semantic  interpre¬ 
tations.  Rules  can  produce  disjunctions  of  triples  in  case  of  ambiguity.  The  rules  primarily 
match  against  NP  and  PP  constituents,  associating  the  semantic  interpretation  of  the  NP 
constituents  with  the  value  element  of  a  triple.  The  algorithm  mainly  recurses  through 
other  types  of  constituents,  though  it  does  note  and  pass  down  certain  items  cf  information 
associated  widi  them,  such  as  the  head-predicate  of  a  VP. 

Rules  consist  of  a  syntactic  pattern  component  followed  by  optional  extra  constraints 
and  an  attribute  assignment  component.  For  example  the  rule: 


(PP  :pp  FR^  :dbj«ct) 

— » 

(SC»tT  :dbj«et  CZTY) 

(RESIRZCT-SLOT  EQCJAL  (:OR  0RZ6ZN-CZTF  XRANS-TO-CZTF)  :objaet) 


applies  to  PPs  whose  preposition  is  “from”.  It  requires  that  the  NP  object  of  the  PP  be 
of  the  semantic  class  CITY.  It  restricts  either  the  ORIGIN-CITY  or  TRANS-TO-CITY 
attributes  to  be  EQUAL  to  the  semantics  of  the  NP  object  When  applied  to  the  fragment: 


[PP  fxon 

[NP  boston] ] 


it  generates  the  following  two  triple  alternatives: 


<EQUAL  ORIGIN-Cmr  BOSTON> 

<EQUAL  TRANS-TO-Cnr  BOSTON> 

corresponding  to  the  two  alternatives  possible  in  a  situation  where  “from  Boston”  is  uttered: 
either  the  user  wants  to  fly  from  Boston  to  some  different  city  or  he  wants  to  get  from 
Boston  to  its  airport. 

Rules  have  a  slightly  more  complicated  form  when  they  involve  an  important  feature  of 
the  Frame  Combiner’s  triple-generaticm  process:  its  use  of  syntactic  structure  and  context 
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For  example,  in  the  ATIS  domain  the  PP  “at  3  pm”  means  something  very  different  when 
attached  to  a  verb  like  “arrive”.  This  phenomenon  tends  to  pose  a  problem  for  conventional 
non-integrated  template  matching  system,  as  has  been  not^  in  earlier  work  [47]. 


In  the  Frame  Combiner  this  is  handled  by  passing  down  the  predicate  representing  a 
verb’s  meaning  as  an  extra  argument  to  the  recursive  descent  algorithm.  If  a  oxistituent 
was  attached  to  a  VP  with  a  particular  meaning,  the  slot-filling  process  knows  this  when 
it  reaches  the  constituent  Slot-filling  rules  can  be  written  in  such  a  way  as  to  behave 
differently  depending  on  whether  the  constituent  under  consideratitm  is  in  tire  ccmtext  of  a 
particular  verbal  predicate. 

For  example,  in  order  to  deal  with  the  above  phenomenon,  the  following  rule  applies 
to  PP  fragments  where  the  NP  :  OBJECT  is  of  type  TIME-OF-DAY,  and  :PREP  is  any 
preposition  from  which  a  temptnal  relation  can  be  derived.  This  temporal  relation  restricts 
whatever  slot  is  determined  appropriate  by  the  ATTRIBUTES  component  of  the  rule: 


(PP  :PP  :PREP  : OBJECT) 

— » 

(SORT  '.OBJECT  TIMB-OF-DAY) 
(TEMPORAL-BEZATXON  :PBEP  :BEL) 
(RESXRICT-SLOT 
:ItEL 

(ATTRIBUTES 

(CONTEXT  ARRIVE  ARRIVAL-TIMB) 

(DEFAULT  DEPARTDBX-TIMB) 

(GENERAL  ARRIVAL-TIME  DEPARTURE -TINE ) ) 
: OBJECT) 


The  ATTRIBUTES  expressimi  delivers  zero  or  more  attributes  as  a  disjunction  of  die 
specific  attributes  depending  upon  which  of  its  evidence  clauses  is  satisfied.  CONTEXT 
evidence  is  the  strongest.  NON-LOCAL  evidence  is  next,  and  it  includes  situations  where  a 
particular  verb  is  merely  present  elsewhere  in  the  input,  without  dominating  constituent 
DEFAULT  evidence  is  Ae  assigrunent  preferred  whereas  GENERAL  evidence  is  all  the 
assigrunents  allowed. 


Filtering  Sets  of  Triples 


When  all  fragments  have  been  analyzed  through  recursive  descent,  the  system  takes  the 
Clartesian  product  of  all  disjunctive  interpretations  to  obtain  the  set  of  all  alternative  sets 
of  triples.  These  are  then  filtered  to  leave  only  the  most  plausible  sets  of  triples. 
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There  are  several  criteria  for  plausibility.  The  most  obvious  is  that  two  or  more  triples 
on  the  same  attribute  not  specify  contradictoiy  values  for  the  attribute.  Another  is  that  a 
set  not  contain  any  two  triples  with  clashing  attribute  domains.  For  example,  in  the  AUS 
task  one  never  sees  queries  that  combine  flights  and  groimd  transportation  (even  though 
such  are  certainly  expressible,  e.g.  “Show  me  USAir  flights  to  airports  that  have  limousine 
service”).  Thus  FLIGHT  and  GROUND-TRANSPORTATION  are  clashing  domains.  On  the 
other  hand,  queries  concerning  both  flights  and  fares  do  frequently  occur  C'Show  flights  to 
Boston  and  their  fares”)  so  FLIGHT  and  FARE  are  not  clashing  donudns. 

Another  criterion  is  that  the  set  of  triples  have  the  commtmly  seen  linguistic  form  for  the 
domain.  Thus,  while  “the  airport”  and  “Ae  city”  are  plausible  fiUers  for  lltANS-'TO-AZRPORT 
and  TRANS-TO-CITY  in  the  GROUND-TRANSPORTATION  domain  they  are  much  less 
plausible  fillers  for  FLIGHT  domain  attributes,  simply  because  proper  noun  fillers  are  far 
more  common  for  these. 

Criteria  such  as  non-clashing  domains  are  hard  criteria,  and  therefore  any  triple  set 
which  violates  them  is  discarded.  Other  criteria,  like  fire  plausibility  of  linguistic  domain, 
are  softer,  and  the  system  merely  prefers  not  to  violate  them. 

If  there  is  more  than  one  plausible  set  of  triples,  the  Frame  Combiner  will,  depending  on 
switch  setting,  either  give  up  or  appeal  to  extraseniential  discourse  to  resolve  the  ambiguity 
(much  as  the  core  DELPHI  system  does). 


Choosing  the  Information  to  Display 


At  each  turn  in  dialogue,  any  system  performing  an  infcmnation  retrieval  task,  such  as 
ATIS,  is  essentially  required  to  display  a  set  of  objects.  This  holds  for  WH  questions 
(“which  flights. . .  ”),  imperatives  (“show  me”),  and  existential  yes-no  questions  C'arc  there 
any  flights. . .  ”).  On  this  perspective,  the  different  sets  of  objects  and  relationships  between 
them  are  one  part  of  the  meaning  of  the  quny,  and  are  represented  by  the  sets  of  triples. 
The  other  part  of  the  meaning  is  the  question  of  which  of  these  sets  to  display.  We  refer 
to  this  as  the  “topic”  of  the  query. 

To  choose  one  (or  more)  of  the  triples  as  the  topic  means  to  display  its  value  set,  as  it 
relates  to  all  other  value  sets  of  the  other  triples.  Several  different  heuristics  are  used,  and 
are  ranked  in  priority.  Each  is  tried  in  succession  until  a  topic  is  chosen. 

Most  obvious  is  whether  the  filler  of  the  triple  is  a  WH  noun  phrase.  If  it  is,  it  definitely 
must  be  the  topic. 
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Next  are  any  “priority”  domains  that  are  not  normally  used  merely  to  constrain  other 
sets.  An  example  is  GROUND-TRANSPORTATION — the  typical  ATIS  user  does  not  ask 
to  see  cities  that  have  a  particular  type  of  ground  transportatitxi — the  user  wants  to  see  the 
ground  transportation  itself. 

“Unconstrained”  triples  are  another  likely  topic.  A  triple  is  “unconstrained”  if  its  filler 
is  a  bare  common  nominal,  such  as  “airline”,  and  its  attribute  is  a  total  function.  Since 
every  FLIGHT  has  an  AIRLINE,  the  user  is  most  unlikely  to  be  imposing  die  vacuous 
constraint  that  the  flight  is  on  some  airline  (even  though  this  is  again  expressible).  Rather, 
the  user  is  much  more  likely  to  be  interested  in  seeing  the  airline  of  the  flight 


Generating  the  Final  Interpretation 

The  Frame  Combiner  generates  a  final  logical  form  from  a  chosen  set  of  triples  by  first 
associating  a  variable  with  each  triple  filin'  (“value”  slot)  and  a  variable  widi  each  of 
the  core  classes  present  in  the  set  whether  through  explicit  attributes  tm  the  class  or 
implicitly  as  the  domain  of  another  attributes.  It  generates  a  matrix  formula  in  which  all 
the  attributes  present  are  binary  relations  and  tl^  generated  variables  are  the  arguments 
to  these  binary  relations.  Quantificational  structure,  corresponding  to  the  fillers  of  triples, 
is  then  generated.  The  quantifiers  for  topic  triples  are  treated  as  though  they  were  WH 
quantifiers,  and  appropriate  display  commands  generated. 


4.3.4  Results  and  Discussion 


As  an  attempt  to  measure  the  effects  of  these  different  fall  back  strategies,  we  ran  a  number 
of  tests  using  the  February  1991  cross-site  evaluation  test  data.  Using  the  same  constant 
executable  lisp  image  (“disksave”)  run  for  the  official  results,  the  test  was  run  using  a 
number  of  different  svritch  settings,  and  scored  with  the  version  of  the  NIST  comparator 
used  for  the  official  results.  The  switch  ctxiditions  were:  no  fallback  processing  at  all 
(which  is  simply  the  core  DELPHI  system).  Syntactic  Comlnner  only.  Frame  Ccxnbiner 
only,  and  both  Syntactic  Combiner  and  Frame  Combiner  working  together  (which  was  the 
condition  used  in  the  official  results).  The  figures  for  NL  only  are  reported  in  Table  4.2. 

Note  the  frame-only  condition  is  actually  better  flum  result  officially  reported,  in  which 
both  fallback  sub-components  were  used. 

For  the  SLS  test,  the  output  of  BBN’s  BYBLOS  N-best  recognizer  was  used,  with  N  = 
5.  The  core  DELPHI  system  (without  fragments)  was  first  tested  against  the  five  ffieories. 
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%T 

%F 

%NA 

%WE 

no  fallback 

69.5 

7.4 

23.3 

38.1 

syn  only 

73.1 

8.9 

18.0 

35.8 

frame  only 

78.3 

9.6 

12.1 

31.3 

both(ofhcial) 

76.7 

10.6 

12.7 

33.9 

Table  4.2:  NL  Results,  February,  1992 


If  an  interpretation  was  found  fOT  one  of  them,  it  was  returned.  Otherwise,  die  fallback 
methods  were  applied. 

Results  for  three  of  the  four  conditions  are  seen  in  Table  4.3  (results  for  the  no¬ 
fallback  =  core  DELPHI  condition  were  unavailable  as  of  this  writing).  The  figure  for  die 
combination  of  both  fragment  modules  (the  configuration  used  in  dre  official  test)  reflects  an 
slight  downward  adjustment  fixim  the  original  value  of  43.7  that  corrects  a  purely  procedural 
error  committed  during  our  running  the  test  (the  file  that  specifies  ‘^today’s  date”  for  each 
query  was  not  loaded,  leading  to  a  small  increase  in  the  number  of  wnmg  answers).  This 
problem  was  fixed  in  obtaining  the  results  in  Table  4.3. 


1 

%NA 

%WE 

syn  only 

mnm 

msm 

frame  only 

both(ofhcial) 

71.8 

15.4 

12.8 

43.7 

both(adjusted) 

71.9 

15.1 

13.0 

43.2 

Table  4.3:  SLS  Results,  February,  1992 


As  in  the  regular  NL  test,  the  SLS  results  show  an  noticeable  improvement  over  the 
official  results  when  the  Frame  Combiner  is  used  alone. 

These  results  tend  to  undercut  a  central  |xemise  of  our  original  strategy:  namely  diat 
using  both  fragment  ccmibination  methods  togedier  would  improve  the  result  over  the  use 
of  either  alone.  Our  tentative  hypothesis  is  that  the  Syntactic  Combiner,  when  failing 
aiul  passing  to  the  Frame  (jombiner  die  best  results  of  its  combination  attempt,  is  passing 
wrongly  combined  fragments  which  mislead  the  Frame  Combiner. 

On  the  other  hand,  these  results  do  show  the  utility  of  the  Frame  Qxnbiner  when  used 
alone.  For  NL  only,  it  reduced  the  No  Answer  rate  by  1 1.2  percentage  points,  and  Weighted 
Error  by  4.2  percentage  points.  For  SLS,  it  reduced  the  Weighted  Enm*  from  the  adjusted 
official  value  of  43.2%  to  a  new  low  of  39.2%. 


Chapter  5 


Discourse 


During  the  course  of  this  project,  we  implemented  the  first  discourse  compcment  fco-  DEL¬ 
PHI  hi  this  chapter,  we  discuss  die  current  discourse  module  of  DELPHI  (Section  5.1), 
which  is  design^  for  the  AUS  domain,  whose  special  discourse  requirements  are  detailed. 
We  also  discuss  the  initial  implementation  in  DELPHI  of  domain  independent  discourse 
techniques,  based  on  earlier  woik  done  at  BBN  for  other  DARPA  contracts  [8]  (Section 
5.2). 


5.1  Current  Discourse  Module 


5.1.1  Discourse  Phenomena  in  the  AUS  Domain 

We  begin  our  discussion  of  DELPHI’S  current  discourse  module  with  an  overview  of  the 
type  of  phenomena  found  in  the  ATIS  common  task  domain. 

While  the  first  tests  of  natural  language  oqrability  in  the  DARPA  SLS  program  stressed 
the  ability  to  understand  single  utterances,  later  tests  have  taken  into  account  the  require¬ 
ment  diat  a  natural  language  system  used  fcv  the  ATIS  task  is  intendeded  to  fac^tate 
conversatitMis  invdving  more  dian  a  single  utterance.  The  speaker  typically  builds  iqp  a 
travel  plan  over  the  course  of  several  interactions  with  the  systmn,  and  the  infocmation 
already  given  for  the  plan  is  assumed  and  is  usually  not  repeated  nor  explicitly  referred 
to  by  the  user  in  later  utterances.  This  implies  that  natural  language  systems  for  problem 
solving  tasks  such  as  ATIS  must  be  sensitive  U)  the  context  of  the  task  and  the  conversation, 
in  order  to  intepret  sentences  as  the  user  meant  them. 
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The  following  sequences  of  queries  frim  actual  problem  solving  sessicms  from  the 
ATIS2  corpus  indicate  some  of  the  phenomena  that  must  be  covered.  DELPHI  currently 
handles  nearly  all  of  these  examples  conectly.  Numbers  in  parentheses  are  die  official 
utterance  IDs  assigned  by  NIST. 

The  first  session  shows  a  type  of  anaphtnic  refoence  that  is  quite  standard  in  die 
linguistic  literature — ^“that  flight"  (in  e7(X}73sx)  refers  back  to  the  fli^t  mentitmed  in  the 
preceding  sentence  (e70073sx). 


(e70073sx)  What’s  the  latest  fli^t  out  of  Denvo*  that  arrives  in  Pittsburgh  next  Monday? 
(e70083sx)  What’s  the  economy  class  fare  fen*  diat  flight? 


The  second  session  shows  a  number  of  phenomena.  The  first  sentence  (0h(X)16vx)  is 
literally  a  description  of  the  user’s  goals,  but  iht  system  must  interpret  it  as  a  command  to 
display  the  acceptable  flints.  The  second  sentence  (0h(X)26vx)  is  clearly  an  elliptical  utter¬ 
ance,  intended  to  add  a  constraint  to  the  previous  sentence.  The  third  sentence  (0h(X)36vx) 
shows  an  interesting  and  frequent  charactnistic  of  the  ATTS  dmnain.  What  lodes  like  a 
clear  definite  reference  “fli^t  number  seven  thirty  one"  must  be  treated  in  the  context  of 
the  dialogue,  since  there  are  often  several  flights  with  the  same  number.  In  fact,  diis  must 
be  taken  in  the  context  of  the  answer  to  the  inevious  query,  and  not  the  text  of  the  previous 
queries  alone.  This  is  particularly  true  since  connecting  flights  are  represented  with  strings 
like  "US258AJS424’’  and  users  refo'  to  such  flights  with  varied  expressions.  Among  those 
we  have  seen  are: 


flight  two  fifty  eight 
US  flight  two  five  eight 
US  Air  two  five  eight  slash  four  two  four 
US  two  five  eight  four  two  four 


Note  that  “US  flight  two  five  eight"  may  refer  to  the  connection  US2S8/US424,  or  to  the 
first  flight  in  the  connection,  and  that  a  single  flight  such  as  US2S8  can  go  fnun  one  city 
to  a  second  and  on  to  a  diird,  and  thus  the  reference  can  be  intended  to  any  of  the  legs  or 
the  entire  flight  with  that  identifier. 


((%(X)16vx)  I  would  like  to  fly  from  Boston  to  Philadelphia  next  Thursday. 
(0h0026vx)  Between  three  and  four  PM. 

(0h(X)36vx)  What  type  of  aircraft  is  flight  number  seven  thirty  one? 
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The  third  session  shows  a  deictic  reference  to  ‘*today*’  (e70011sx),  which  must  be 
resolved  by  the  system  based  on  the  reported  date  of  the  interacticm.  The  second  sentence 
(e70021sx)  shows  “one"  used  anaphorically,  and  die  third  (e70031sx)  shows  a  deictic 
anaphOT  which  can  be  interpreted  in  terms  cf  either  die  previous  query  <x  its  answer. 


(e7001  Isx)  Give  me  the  flights  from  Bosttm  m  San  Francisco  leaving  early  today. 
(e70Q21sx)  Show  me  tte  (me  that  arrives  die  earliest  in  San  Francisco. 

(e70031sx)  What  kind  of  plane  is  that? 


The  fourth  session  shows  die  type  of  “ccmstraint  extension”  dialtigue  that  is  commcm  in 
this  domain.  The  second  utterance  (ls0023vx)  is  certainly  a  sentential  (clausal)  utterance 
(pertiaps  missing  an  origin  or  destination  phra»),  not  a  Noun  Phrase  or  a  Prepositional 
Phrase.  It  must  be  taken  as  a  further  constraint  cm  the  trip  plan  that  the  user  has  indirectly 
mentioned  in  the  first  query.  The  seemingly  “obvious”  pnmominal  reference  “it”  in  the 
third  sentence  (ls0033vx)  does  not  refer  to  any  Noun  Phrase  in  die  preceding  sentences, 
but  rather  to  the  flight  implicitly  specified  by  the  second  sentence.  A  clearer  case  of 
elliptical  spectification  of  further  c(mstraints  is  given  in  die  fifdi  utterance  (lsOOS3vx). 
More  interestingly,  the  fourth  utterance  (ls0043vx)  resets  the  assumed  constraints  given 
previously,  otherwise  the  fifth  utterance  could  not  be  interpreted  cmnecdy  as  specifying  a 
flight  from  Boston  to  Atlanta  (not  the  reverse)  leaving  (this  is  die  default  assumption  fin* 
times  in  these  tasks,  judging  by  user's  acti(ms)  at  8:24  PM. 


(Is0013vx)  Do  you  have  a  flight  from  Atlanta  to  Bosicm? 
(Is0023vx)  I  would  like  to  go  at  six  thirty  six  AM. 
(Is0033vx)  Where  does  it  stop? 

(Is0043vx)  I  would  like  to  fly  from  Boston  to  Atlanta. 
(Is0053vx)  Eight  twenty  four  PM. 


The  fifth  session  shows  more  constraint  extension.  It  raises  the  questicm  as  to  when 
a  constraint  context  is  reset  In  most  cases  when  the  user  explutitly  mentions  an  ori¬ 
gin  and  destination,  all  previous  constraints  are  likely  to  be  dropped.  In  the  last  utter¬ 
ance  (lr0043vx),  it  is  unclear  whether  the  date  constraint  “(On)  august  thirty  first”  (from 
lr()023vx)  should  still  be  in  force. 


(Ir0013vx)  Okay,  I’d  like  to  fly  firom  Denver  to  Pittsburgh. 

(Ir0023vx)  The  date  will  be  August  thirty  first 
(lr0033vx)  I’d  like  to  go  in  the  morning. 

(Ir0043vx)  Is  there  a  flight  from  Denver  to  Pittsburgh  around  eight  AM? 
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The  last  ses^non  illustrates  that  a  disctnirse  in  the  AUS  domain  can  switch  between 
topics:  here,  from  flight  planning  to  grouml  transpcmation  planning  and  back  again. 


(ew0013ss)  What  flights  leave  from  Boston  to  Pittsburgh  in  the  morning? 
(ew0023ss)  Is  fliere  anything  that  arrives  before  eight  fifty  seven? 
(ew0033ss)  How  about  on  Sunday  night? 

(ew0043ss)  How  do  I  get  from  Pittsburgh  airport  to  downtown  Pittsburgh? 
(ew00S3ss)  What  are  the  rates  for  the  ground  transportation? 

(ew0063ss)  Are  there  any  flights  that  arrive  before  nine  PM  on  Sunday? 


5.U  A  Discourse  Module  for  ATIS 


Most  previous  work  tm  discourse  context  for  natural  language  interfaces  has  focused  cm 
recovering  meaning  from  ccmtext  in  the  face  of  explicit  linguistic  cues  in  an  utterance. 
Such  cues  include  prcmouns  and  definite  references  (e.g.  ‘*Which  of  them  stop  in  Dallas”, 
“Do  any  of  those  flights  serve  breakfast**),  as  well  as  clearly  elliptical  utterances  (e.g. 
“on  Sunday",  "morning  flights",  "round  trip").  The  DELPHI  system  includes  a  general 
domain-independent  mechanism  for  dealing  with  such  phenomena. 

The  initial  versicm  of  the  Discourse  Module  utilized  the  notion  of  “discourse  entities*’, 
discussed  in  more  detail  in  Secticm  5.2  below,  which  represent  those  entities  which  are 
introduced  either  explicitly  or  implicitly  in  previous  utterances,  and  which  can  be  refored 
to  by  pronouns  and  definite  noun  phrases.  It  included  mechanisms  for  resolving  references 
to  those  entities,  including  use  of  syntactic  constraints  to  limit  possible  intra-sentential 
referents,  and  a  focusing  model  for  dialogue  state  to  limit  and  dir^  references  to  entities 
in  past  utterances.  This  worit  is  based  on  earlier  worit  done  at  BBN  in  other  DARPA 
contracts  [8]. 

Later  woik  focused  on  handling  die  implicit  discourse  phenomena  such  as  constraint 
extension  that  seems  to  be  critical  in  the  ATIS  domain.  This  was  done  in  a  separate 
component  of  the  system  we  refer  to  as  the  Task  TYacker. 

In  the  last  year  of  the  ccmtract,  it  became  clear  diat  linguistic  issues  were  only  part  of 
the  problem  in  end-to-end  evaluations,  and  that  a  noticeable  loss  of  performance  oocurred 
in  mapping  firom  the  semantic  representation  produced  by  the  semantic  interpreter  and 
that  us^  by  the  database  retrieval  ccxnptxient  Thus,  as  part  of  our  move  to  the  use  of 
generalized  grammatical  relations  (Sec^n  3.5),  which  involved  a  major  change  to  the 
semantic  inteipreter,  we  decided  to  implement  a  more  robust  quantification  module  that 
produced  semantic  representations  of  die  same  type  as  used  by  the  back-end.  Unfortunately 
this  meant  that  the  implementation  of  die  Discourse  Module  (described  in  the  next  section). 
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which  utilized  our  previous  form  of  semantic  representations,  could  no  longer  be  used,  since 
the  discourse  entity  code  was  no  longer  functional.  We  believe,  however,  that  tl»  theory 
and  algorithms  still  are  the  appropriate  basis  for  a  more  general  discourse  component. 

An  analysis  of  the  training  data  convinced  us  that  it  would  be  more  effective  and  efficient 
to  implement  a  much  simpler  interim  pronominal  mechanism,  and  focus  on  the  semantic 
interpreter  and  the  task  tracker,  with  the  intent  of  reimplementing  the  full  discourse  entity 
and  focusing  model  once  the  remainder  of  the  system  had  stabilized.  The  remainder  of 
this  section  describes  the  stages  of  implementation  of  the  Discourse  Module,  including  the 
discourse  entity/focusing/communicadve  act  Discourse  Module,  which  has  been  temporarily 
replaced  in  the  final  system  produced  under  this  contract  (but  which  will  be  reintroduced 
in  later  systems). 


5.2  Initial  Implementation 


In  order  to  treat  multi'Sentence  discourse  phenomena,  we  ported  the  discourse  module  used 
in  the  Janus  multi-modal  interface  system  to  DELPHI[8].  This  component  maintained  a 
representation  of  discourse  state  composed  of  “communicative  act”  structures  representing 
each  user  input,  machine  action  (response),  or  other  communicative  occurrence.  Things 
that  could  be  refetred  to  are  encoded  as  discourse  entities,  elements  created  at  the  level 
of  discourse  representation  which  ate  hypothesized  on  the  basis  of  syntactic  information 
without  being  limited  to  syntactic  constituents  [96]  [97].  For  example,  consider  the  fragment 
of  a  dialogue: 


Ql;  What  flights  depart  from  Boston  for  San  Francisco  after  7  PM? 
Al:  ... 

Q2;  Show  me  their  fares. 


Note  that  “their"  in  Q2  refers  to  “flights  firom  Boston  to  San  Francisco  after  7  PM"  even 
thou^  there  is  no  single  syntactic  constituent  of  this  form  in  Ql.  However,  a  conespemding 
discourse  entity  containing  just  this  information  is  produced  on  the  basis  of  the  information 
in  Ql. 

In  order  to  eliminate  spurious  candidates  as  the  antecedents  of  pronouns,  we  use  syn¬ 
tactic  constraints  on  intra-sentential  anaphora,  such  as  those  described  in  [46]  and  [24], 
as  part  of  our  anaphora  resolution  mechanism.  Centering  algorithms  [27]  [41]  [89]  [90]  are 
used  to  track  focus  for  extra-sentential  anaphora.  Porting  to  DELPHI  involved  interfacing 
the  discourse  component  to  DELPHI’S  meaning  representation  and  parse  structures.  A  new 
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component  for  handling  referent  ambiguity  was  added  which  presented  the  options  to  the 
user. 

After  the  initial  port,  we  added  several  new  ci^abilities  to  our  discourse  module.  Amcmg 
them  was  a  facility  for  head-noun  and  noun-phrase  ellipses.  An  example  of  the  latter  is 
the  following  dialogue: 


Ql:  What  aiiiines  fly  to  Washingtcm? 
Al:  ... 

Q2:  Dallas? 

A2:  ... 


Here  the  second  question,  though  not  a  complete  sentence,  is  understood  to  be  a  shorthand 
fOT  “What  airlines  fly  to  Dallas?”  Our  discourse  module  was  enhanced  to  handle  such 
ellipses. 

We  also  added  a  capability  for  handling  definite  references.  Definite  references  are 
Noun  Phrases  such  as  “this  flight”  and  “the  fares”  that  are  intended  to  refer  to  a  q)ecific 
entity  or  a  group  of  entities.  Our  system  uses  the  semantic  class  informaticm  present  in  the 
Noun  Phrase  to  search  for  an  entity  in  the  preceding  discourse  that  the  definite  reference 
refers  to.  In  the  case  of  a  definite  reference  that  contains  an  qpen  slot  to  be  filled — ^Le.  the 
head  noun  is  a  relational  noun  (see  Section  3.6.1) — such  as  “the  cost”,  the  system  looks 
for  an  entity  in  the  preceding  discourse  that  can  fill  the  slot 


5.2.1  Modifications  for  ATIS 


When  the  ATIS  domain  was  chosen  for  the  purposes  of  cross-site  evaluation,  we  began 
analyzing  the  discourse  phenomena  that  occur  in  that  domain.  Although  the  general-purpose 
discourse  module  of  DELPHI  already  covered  most  instances  of  erq>licit  reference,  it  needed 
to  be  extended  to  obtain  good  coverage  in  AUS. 

Previous  work  on  discourse  context  has  focused  cm  recovering  meaning  from  context  in 
the  face  of  explicit  linguistic  cues  in  an  utterance.  Such  cues  include  pronouns  and  definite 
references  (e.g.  “Which  of  them  stc^  in  Dallas”,  “Do  any  of  those  flights  serve  breakfast”), 
as  well  as  clearly  elliptical  utterances  (e.g.  “on  Sunday”,  “morning  flights”,  “round  trip”). 
The  DELPHI  system  includes  a  general  domain-independent  mecharusm  for  dealing  witii 
such  phenomena. 


As  we  saw  in  Section  5.1.1,  an  investigation  of  the  transcripts  for  ATTS  showed  that 
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the  context  dependence  in  this  task  was  not  always  indicated  by  such  explicit  linguistic 
cues.  Rather,  a  travel  plan  is  gradually  built  up  by  the  speaker,  and  the  information  already 
given  for  the  ^'lan  is  assumed  and  not  repeated  nor  referred  to  explicitly  by  the  user. 

Since  the  user  of  an  ATIS  system  is  typically  planning  a  flight,  which  is  a  “constraint 
satisfaction”  type  of  task,  a  standard  strategy  is  for  Ae  user  to  set  up  a  discourse  context  that 
specifies  some  constraints,  such  as  “I  need  to  fly  from  Boston  to  Dallas  on  Monday.”  Later 
interactions  assume  that  all  such  constraints  are  maintained  unless  explicitly  or  implicitly 
modified  or  removed.  Thus,  even  though  a  sentence  like  “Are  there  any  nonstop  flights?” 
looks  on  the  surface  like  it  is  inteipretable  independent  of  the  discourse  context  (cf.  a 
sentence  firom  the  personnel  database  domain  “Are  there  any  Cambridge  residents?”),  the 
“flights”  it  refers  to  must  satisfy  the  earlier  constraints.  In  general,  any  reference  to  “flight” 
or  “fare”  in  ATIS,  whether  a  definite  reference  (“the  flights”),  a  deictic  reference  C‘tiiat 
flight”,  “those  flights”)  or  an  indefinite  reference  (“any  nonstop  flights”)  is  to  be  assumed 
to  meet  the  constraints  established  earlier  in  the  discourse. 

For  example,  in  the  following  sequence: 


List  the  flights  from  San  Francisco  to  Atlanta,  please. 
Show  me  flights  that  arrive  after  noon. 


the  second  sentence  assumes  that  “flights”  is  already  constrained  to  be  Arose  from  San 
Francisco  to  Atlanta. 

We  found  that  of  the  98  discourse  pairs  supplied  by  TI  for  training  on  the  ATISl  corpus, 
only  14%  contained  explicit  reference.  To  increa^  our  coverage,  we  began  implementing 
an  explicit  representation  of  the  **flight  scenario”,  where  sentences  having  to  do  with  a  flight 
result  in  the  addition  of  constraints  to  the  scenario.  Subsequent  sentences  inherit  Arose 
constraints.  The  simplest  form  of  this  approach  addressed  almost  half  of  the  discourse 
pairs. 

To  do  this,  we  implemented  a  new  modiAe  of  DELPHL  Are  Task  lYacker,  which  main¬ 
tained  a  model  of  the  user’s  task  context  as  it  evolves  in  Are  discourse  of  a  dialogue.  It 
takes  as  input  the  semantic  interpretation  of  the  query,  and  delivers  as  output  a  vositxt 
of  that  semantic  interpretation  malting  explicit  information  Arat  can  be  inferred  from  Are 
Tracker’s  evolving  knowledge  of  the  task  the  user  wishes  to  acctnnplish.  This  task  tracker 
is  intended  to  solve  a  problem  similar  to  the  one  solved  by  Are  “templates”  of  other  systems, 
but  is  more  general  in  a  number  of  ways. 


Again,  consider  the  following  dialogue; 
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Ql:  Show  me  flints  from  Dallas  to  Denver  please. 
Q2:  Show  me  just  the  flights  that  arrive  after  noon. 


the  user’s  real  intended  meaning  for  Q2  is: 


Q2’:  Show  me  flights Dallas  to  Denver  that  arrive  after  nocm. 


and  not  simply  any  flight  that  happens  to  arrive  after  noon,  no  matter  what  its  origin  and 
destination.  The  Thicker  infers  this  crucial  information  and  aHrfg  it  to  die  interpretadmi  of 
die  second  query  so  that  the  correct  retrieval  can  be  made  from  the  database.  It  does  so 
by  maintaining  a  record  (the  ’dask  ctmtext”)  derived  frtrni  die  user’s  previous  queries,  and 
merging  the  stored  information  with  the  new  information  in  die  present  query  to  find  die 
query’s  intended  meaning.  Task  contexts  ate  frame-like,  having  slots  fix*  die  various  pieces 
of  component  infortnation.  For  example,  the  task  context  for  taking  a  fli^t  has  slots  for 
“origin”,  “destination”,  “airline”  and  so  forth. 

To  fill  the  slots  of  the  flight  scenario,  the  “task  tracker^  extracts  cmstraints  from  die 
same  general  semantic  representation  that  we  use  to  represent  die  meaning  of  each  utterance. 
The  constraints  are  themselves  represented  in  the  same  semantic  representation.  In  the 
example  above,  we  need  to  extract  the  represratation  for  “San  Francisco”  and  “Atlanta”  to 
fill  the  “origin”  and  “destination”  slots  respectiiwly.  However,  in  general  the  filler  of  a  slot 
may  be  a  complicated  eiqnessitm,  including  other  quantifiers,  for  example,  as  in  “Show  me 
flints  from  the  airport  in  Bosum”.  Thus,  there  is  no  restriction  diat  ite  constraints  be  oS 
a  particular  simple  form,  such  as  specific  constant  “fillers”  for  “slots”  in  a  template  (e.g. 
“Dallas”  is  the  filler  of  the  DESTINATION  sitn).  We  can  rqnesent  constraints  involving 
negation  CTlights  that  do  not  stop  in  Dallas”),  disjunction  C'fli^ts  diat  stop  in  Dallas  or 
Denver”),  as  well  as  more  complex  descriptions  such  as  “the  airpon  in  Boston”. 

The  original  version  of  the  Task  ’Dacker  made  use  of  the  work  performed  by  the  Dis¬ 
course  Module  in  generating  discourse  entities.  The  Task  Tracker  obtained  die  informaticm 
it  needed  firom  dw  entities,  dius  minimizing  die  need  to  do  manipulation  of  semantic 
representations.  When  the  semantic  representation  changed,  maniinilation  of  semantic  rq>- 
resentatimis  became  much  easier,  but,  at  the  same  time,  die  discourse  entity  generator 
became  unavailable,  so  die  Task  Trackm  took  over  the  job  of  finding  die  slot  fillers  for  die 
scenarios. 

A  crucial  advantage  of  our  approach  to  task  tracking  is  its  modularity.  The  fli^t 
booking  task  of  ATTS  is  in  srane  ways  a  tighdy  ctmstrained  one,  involving  cmly  a  few 
operations  such  as  taking  a  flight  frtxn  (me  city  to  another,  and  finding  a  fate  (m  a  flight 
By  making  die  enccxling  of  diese  constraints  the  responsibility  of  a  separate  modulo-Hhe 
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Task  Tracker — ^we  can  take  advantage  of  the  constramts  without  being  bound  by  them  in  the 
design  of  the  other  linguistic  components  of  our  system.  In  addition,  we  built  the  tracker 
in  a  modular  fashion,  so  that  we  could  readily  add  new  types  of  constraints  to  the  tracker, 
and  represent  how  the  specification  of  a  constraint  could  affect  preexisting  constraints. 

The  task  tracker  is  also  used  in  the  process  of  interpreting  single  “utterances”  diat 
appear  to  be  composed  of  multiple  sentences,  such  as 

(4i00blsx)  I’m  sorry,  I  wanted  to  fly  TWA;  is  there  a  flight  between  Oakland  and 
Boston  with  a  stopover  in  Dallas/Foit  Worth  on  TWA? 

or 

(4b0061sx)  Find  a  flight  between  Denver  and  Oaklaitd; 

the  flight  should  leave  in  the  aftomoon  and  arrive  near  five  PM; 
the  flight  should  also  be  nonstop. 


The  parser  and  semantic  interpreter  break  these  utterances  into  multiple  sentences,  and 
the  system  treats  them  as  a  “mini-discourse”,  without  generating  a  visible  answer  to  the 
intermediate  sentences. 

Another  module  added  to  the  system  is  a  form  of  reference  resolution  that  allows 
the  user  to  refer  to  entities  that  qipear  in  tire  answers  given  by  the  system.  These  are 
sometimes  called  “exaphoric  references”,  to  distinguish  them  from  references  to  discourse 
entities  that  are  introduced  by  linguistic  structures  in  previous  utterances.  This  is  a  very 
general  phenomenon,  and  should  in  principle  be  taken  to  include  such  dungs  as  deictic 
references  (e.g.  “<thosc,  these>  flints”,  “<ihis,  that>  fare”,  “this”  and  “that”).  In  the 
current  DELPHI  system,  however,  we  treat  such  deictic  refnences  as  variants  on  definite 
references  (e.g.  “the  flight”)  and  rely  on  the  task  tracker  to  carry  over  the  constraints 
necessary  to  specify  the  intended  referent  There  is,  nevertheless,  a  type  of  exaphoric 
reference  that  is  specific  to  the  AUS  domain  and  which  we  treat  separately — flight  id 
resolution. 

Flight  id  resolution  is  the  process  of  determiiung  the  intended  reference  of  erqnessitms 
like  “flight  one  three  two”  or  “American  one  oh  one”.  While  these  expressitms  specify 
properties  of  flights,  they  are  generally  intended  to  refer  to  specific  flints,  even  dKwgh 
more  than  one  flight  may  be  have  the  given  fli^t  number  aixl  airline  (the  legs  of  a  direct 
flight  with  a  stop,  for  example).  Worse,  users  often  refer  to  a  cmmecting  flight  by  giving 
the  airline  and  flight  number  of  its  first  segment 

This  problem  is  resolved  when  you  realize  that  in  almost  all  cases,  die  users  have  no 
idea  of  flight  numbers  until  they  see  them  on  the  screen  as  answers  to  previous  questions. 
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Thus  an  expression  involving  a  flight  number  almost  always  is  intended  as  a  reference  to  a 
specific  flight  that  the  system  has  shown  to  the  user  as  part  of  the  answer  to  a  previous  query. 
A  general  solution  would  involve  the  system  keeping  track  of  the  answers  it  has  presented 
(the  displays  it  has  made  in  the  case  of  a  graphic  system),  and  having  a  mechanism  to  find 
those  elements  of  the  display  which  the  user  might  have  perceived  as  satisfying  a  given 
description.  This  is  in  general  a  very  difficult  problem,  since  it  in  principle  requires  having 
a  model  of  how  the  user  perceives  displays  and  describes  portitxis  of  those  displays.  In  die 
ATIS  domain  there  is  a  simple,  and  quite  effective  variation  on  this  technique.  DELPHI 
keeps  track  of  the  flights  appearing  in  answers  to  previous  queries,  and  has  a  mechanism 
to  find  the  most  recently  displayed  flight  that  satisfies  the  given  flight  id  description.  Every 
displayed  flight  has  either  one  or  two  airline  id/fli^t  number  pairs  (direct  flights  have  one 
pair,  connections  have  two  pairs  separated  by  a  slash).  It  is  simple  to  map  typical  linguistic 
descriptions  of  these  pairs  to  the  pairs  themselves.  This  mechanism  as  such  is  quite  specific 
to  the  ATIS  domain,  but  the  notion  of  exaphoric  reference  resolution  is  quite  general. 


Chapter  6 


Integration  of  Speech  Recognition  with 
NL 


Many  methods  have  been  pr(qx>sed  for  tte  integratitm  of  speech  lecognidm  and  natural 
language  understanding.  Ideally,  we  would  lilm  die  natural  language  understanding  compo¬ 
nent  of  a  spoken  language  understanding  system  to  aid  in  the  speech  recognition  process  by 
providing  powerful  constraints  on  the  allowable  sentences.  However,  since  the  language 
models  used  for  natural  language  understanding  are  usually  quite  complex,  the  tightly 
coupled  integration  of  these  two  comptments  usually  results  in  very  large  ctnnputation. 

For  example,  a  first-order  statistical  language  model  can  reduce  perplexity  by  at  least  a 
factor  of  10  with  little  extra  computation  over  using  no  granunar,  while  qrplying  complete 
natural  language  (NL)  models  of  syntax  arul  semantics  to  all  partial  hypcxheses  typically 
requires  much  more  computation  for  less  perplexity  reduction.  (Murveit  [69]  has  shown  that 
the  use  of  an  efficiently  implemented  syntax  component  within  a  recogrution  search  actually 
slowed  down  the  search  unless  it  was  used  very  sparingly.)  In  addititm  to  reducing  total 
computation,  the  resulting  systems  ate  more  modular  when  we  separate  radically  diffinent 
knowledge  sources  (KSs). 

Instead  of  trying  to  use  rutural  language  as  a  tightiy  coiqiled  constraint  on  the  qreedi 
recognititm,  we  have  developed  die  N-Best  Paradigm  for  die  integratkm  of  multiple  knowl¬ 
edge  sources.  The  basic  idea  is  to  use  a  subset  of  die  most  efficient  knowledge  sources  to 
find  not  <me,  but  a  list  of  all  <rf  the  likely  sentence  hypodieses.  These  hypodieses  ate  dim 
filtered  and  rescored  using  the  rmiaining  knowledge  sources.  For  die  problmi  luunral 
language  understanding,  this  means  that  the  NL  comptment  need  only  process  die  text  of 
the  most  likely  word  sequences.  Thus,  the  tiim  required  is  reduced  by  several  orders  of 
magnitude.  We  first  introduced  the  basic  idea  for  this  paradigm  and  an  exact  algorithm 
for  finding  die  N-Best  smtence  hypodieses  at  the  Cape  Code  DARPA  Spoken  Language 
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Workshop  in  October  of  1989. 

In  the  sections  that  follow  we  describe  the  issues  and  the  basic  algorithm  in  more 
detail  below.  The  first  section  discusses  the  general  issues  of  integration  strategies,  and 
introduces  the  N-Best  Paradigm.  It  also  includes  the  first  exact  algorithm  for  finding  the  N- 
Best  sentence  hypotheses.  The  second  section  presents  two  additioiud  algorithms  for  finding 
the  N-Best  hypotheses.  These  two  algorithms  are  approximations  that  require  much  less 
computation  than  the  exact  algorithm.  The  accuracy  and  speed  of  the  three  algorithms  are 
compared.  The  third  section  presents  sevoal  new  uses  tiiat  we  have  found  for  the  N-Best 
hypotheses.  In  the  fourth  section,  we  describe  how  we  have  used  the  N-Best  algorithm  as 
part  of  our  integration  strategy  of  speech  with  natural  language  processing. 


6.1  The  N-Best  Paradigm  and  an  Exact  N-Best  Algorithm 


The  N-Best  algorithm  is  a  time-synchronous  \fiterbi-style  beam  search  procedure  tirat  is 
guaranteed  to  find  the  N  most  likely  whole  sentence  alternatives  that  are  wititin  a  given 
a  “beam”  of  the  most  likely  sentence.  The  ctnnputation  is  linear  with  the  length  of  tire 
utterance,  and  also  linear  in  N.  When  used  together  with  a  first-order  statistical  grammar, 
the  correct  sentence  is  usually  within  the  first  few  sentence  choices.  The  output  of  the 
algorithm,  which  is  an  ordered  set  of  sentence  hypotheses  with  acoustic  and  language 
model  scores  can  easily  be  processed  by  natural  language  knowledge  sources  witiiout  the 
huge  expansion  of  the  search  space  that  would  be  needed  to  include  all  possible  knowledge 
sources  in  a  top-down  search. 


6.1.1  Introduction 

In  a  spoken  language  system  (SLS)  we  have  a  large  search  problem.  We  must  find  the 
most  likely  word  sequence  consistent  with  all  knowledge  sources  (speech,  statistical  N- 
gram,  natural  language).  The  natural  language  (NL)  knowledge  sources  are  many  and 
varied,  and  might  include  syntax,  semantics,  discourse,  pragmatics,  and  prosodies.  One 
way  to  use  all  of  these  constraints  is  to  perform  a  top-down  tightly-coupled  search  that,  at 
each  point,  uses  all  of  the  knowledge  sources  (KSs)  to  deternune  which  words  can  come 
next,  and  with  what  probabilities.  Assuitung  an  exhaustive  search  in  this  space,  we  can  find 
the  most  likely  sentence.  However,  since  many  of  these  KSs  contain  “long-distance”  effects 
(for  example,  agreement  between  words  that  are  far  apart  in  the  input),  the  search  space  can 
be  quite  l^e,  even  when  pruned  using  various  beam-search  or  best-first  search  techniques. 
Fur^ermore,  a  top-down  search  strategy  requires  that  all  of  the  KSs  be  formulated  in  a 
predictive,  left-to-right  marmer.  This  may  place  an  uimecessary  restriction  on  die  type  of 
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knowledge  that  can  be  used. 

The  general  solution  that  we  have  adopted  is  to  apply  the  KSs  in  the  proper  order 
to  constrain  the  search  progressively.  Thus,  we  trade  off  the  entropy  reducticxi  that  a  KS 
provides  against  the  cost  of  applying  that  KS.  Naturally,  we  can  also  use  a  pruning  strategy 
to  reduce  die  search  space  further.  By  ordering  the  various  KSs,  we  attempt  to  tninimirA 
the  computational  costs  and  complexity  for  a  given  level  of  search  error  rate.  To  do  this  we 
apply  the  most  powerful  and  cheapest  KSs  fu^t  to  generate  fht  top  N  hypotheses.  Then, 
these  hypotheses  arc  evaluated  using  the  remaining  KSs.  In  the  remainder  of  diis  paper  we 
present  the  N-best  search  paradigm,  followed  by  the  N-best  decoding  algorithm.  We  give 
an  outline  of  the  proof  that  the  algorithm  does,  in  fact,  result  in  the  correct  list  of  sentence 
hypotheses.  Finally,  we  present  statistics  of  the  rank  of  the  correct  sentence  in  a  list  of  the 
top  N  sentences  using  acoustic-phonetic  models  and  a  statistical  language  model. 


6.1.2  The  N-best  Search  Paradigm 

Figure  6.1  illustrates  the  general  N-best  search  paradigm.  We  order  the  various  KSs  in 
terms  of  their  relative  power  and  cost.  Those  that  provide  more  constraint,  at  a  lesser  cost, 
are  used  first  in  the  N-best  search.  The  output  of  this  search  is  a  list  of  the  most  likely 
whole  sentence  hypotheses,  along  with  their  scores.  These  hypotheses  are  then  rescored 
(or  filtered)  by  the  remaining  KSs. 

Depending  on  the  amount  of  computation  required,  we  might  include  more  or  fewer 
KSs  in  the  initial  N-best  search.  For  example,  it  is  quite  inexpensive  to  search  using  a 
first-order  statistical  language  model,  since  the  number  of  acoustic  and  language  states  is 
small.  Frequently,  a  syntactic  model  of  NL  will  be  quite  large,  so  it  might  be  reserved 
until  after  the  list  generation.  Given  a  list  of  hypothesized  sentences,  each  alternative  can 
usually  be  parsed  in  turn  in  a  fraction  of  a  second.  If  the  syntax  is  small  enough,  it  can  be 
included  in  the  initial  N-bqst  search,  to  further  reduce  the  list  that  would  be  presented  to 
the  remainder  of  the  KSs.  We  can  also  use  diis  paradigm  in  exjunction  with  high-order 
statistical  language  models.  While  a  high-order  model  frequently  provides  added  power 
(over  a  first-order  model),  the  added  power  may  not  be  commensurate  with  the  laige  amount 
of  extra  cxiputation  and  storage  needed  for  the  search.  In  this  case,  a  first-<nder  language 
model  can  be  used  to  reduce  die  choice  to  a  small  number  of  alternatives  which  can  then 
be  reordered  using  the  higher-order  model. 

Besides  the  obvious  computational  and  storage  advantages,  there  are  several  odier  prac¬ 
tical  advantages  of  this  paradigm.  Since  the  ouqiut  of  die  first  stage  is  a  small  amount 
of  text,  and  there  is  no  further  processing  required  from  die  acoustic  recognidx  compo 
nent,  the  interface  between  the  speech  recognition  and  die  other  KSs  is  trivially  simple. 
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Figure  6.1:  The  N-best  Search  Paradigm.  The  most  efficient  knowledge  sources,  KSl,  are 
used  to  find  the  N  Best  sentences.  Then  die  remaining  knowledge  sources,  KS2  are  used 
to  reorder  the  sentences  and  pick  the  most  likely  (me. 

while  still  optimal.  As  such  this  paradigm  provides  a  most  convenient  mechanism  for  in¬ 
tegrating  work  in  a  modular  way.  This  high  degree  of  modularity  means  that  the  different 
component  subsystems  can  be  optimized  and  even  implemented  sqiarately  (both  hardware 
and  software).  For  example,  the  speech  recogniticm  might  run  on  a  special-purpose  array 
processor-like  machine,  while  the  NL  npght  run  on  a  general  purpose  host 


6.13  The  N'Best  Decoding  Algorithm 


The  optimal  N-Best  decoding  algorithm  is,  in  spirit  quite  similar  to  the  time-synchronous 
^terbi  decoder  that  is  used  quite  commonly.  However,  it  differs  in  what  it  must  ccxnpute 
and  in  its  implementation.  It  must  compute  probabilities  of  word-sequences  rather  dian 
state-sequences,  and  it  must  find  all  such  sequences  within  the  specified  beam.  Tte  basic 
idea  is  to  keq>  separate  records  for  theories  with  different  w(ml  sequence  histories.  Each 
path  is  marked  with  an  identifier  that  represents  the  complete  sequence  of  wonls  up  to  diis 
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point  (the  history).  When  two  or  more  paths  come  to  the  same  state  at  the  same  time, 
we  check  whether  there  is  already  an  existing  path  at  that  state  with  the  same  history.  If 
there  is,  we  add  the  probability  for  the  two  paths.  Otherwise,  we  create  a  new  path.  When 
all  paths  for  a  state  have  been  created,  we  reduce  the  number  of  paths  by  keeping  up  to 
a  specified  maximum  number  N  of  theories  whose  probabilities  are  within  a  threshold  of 
the  probability  of  most  likely  word  sequence  at  that  state.  Note  dua  this  state-dependent 
threshold  is  distinct  from  and  smaller  tlum  die  global  beam  search  threshold. 

Since  probabilities  for  different  word  sequences  are  kept  distinct,  it  is  easy  to  see  that 
any  word  sequence  hypothesis  that  reaches  the  end  of  the  sentence  has  an  accurate  score. 
This  score  is  the  conditional  probability  of  the  observed  acoustic  sequence  given  this  word 
sequence.  Of  course,  since  the  number  of  possible  word  sequences  grows  exponentially,  we 
must  use  a  pruning  algorithm  to  reduce  it  to  the  desired  number.  The  interesting  question 
is  whether  one  can  prove  that  all  of  the  word  sequences  with  probabilities  greater  than  the 
threshold  will  end  up  in  the  list  with  the  correct  scores. 


Algorithm  Optimality 

There  have  been  two  recent  papers  that  deal  with  the  topic  of  finding  more  than  one 
answer  for  the  whole  sentence  [58,  95].  However,  both  of  these  papers  are  based  on  the 
Viterbi  algorithm.  That  is,  when  two  paths  for  the  same  word  sequence  come  to  the  same 
state,  the  probability  is  computed  as  die  maximum  of  the  two  paths  rather  than  die  sum. 
Thus  these  algorithms  find  die  most  likely  sequence  of  states  rather  than  the  most  likely 
sequence  of  words.  More  importandy  diough,  the  alternative  answers  are  ctmstrained  by 
the  segmentation  and  traceback  of  the  most  likely  answer.  Since  the  segmentation  of  the 
sentence  into  words  often  depends  on  the  words  chosen,  die  answers  found  in  this  way  are 
not,  in  fact,  the  best  N  answers.  In  fact,  we  have  found  in  the  past  that  this  approximation 
is  quite  severe.  In  [95],  the  exact  algorithm  for  the  word  sequences  corresponding  to  the 
best  state  sequences  is  mentioned,  but  is  not  u%d,  due  to  the  computational  requirements. 
The  results  given  in  [95]  using  a  statistical  bigram  grammar  of  perplexity  124  show  that 
approximately  one  third  of  the  sentences  that  are  not  recognized  cmrexnly  on  die  first  choice 
have  the  correct  answer  within  the  top  10  choices  found  by  the  iqiproximate  algtnithm.  As 
will  be  seen  in  the  next  section,  with  the  exact  algorithm  used  henne,  for  a  similar  statistical 
grammar,  about  90%  of  the  sentences  diat  are  not  recognized  correcdy  on  the  first  chdce 
have  the  correct  answer  within  the  top  10  choices  found  by  the  qiproximate  algorithm, 
and  about  97%  are  within  the  top  24  choices.  It  should  be  n^ntitmed  that  diese  tests  have 
been  performed  (xi  different  speech  corpora,  with  different  acoustic  and  language  models, 
making  direct  comparisons  difficult 

It  should  be  clear  that  the  algorithm  used  here  would  result  in  the  exact  solutimi  for 
all  of  the  possible  answers  for  a  given  utterance.  It  is  harder  to  see  that  the  algorithm  that 
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finds  the  N-Best  answers  within  a  threshold  of  the  best  answer,  in  fact  does  so.  The  proof 
(which  is  not  included  here  in  its  entirety)  relies  on  the  fact  that  the  beamwidth  at  each 
state  is  very  large — typically  on  the  order  of  10*^.  Possible  mors  could  occur  when  we 
should  be  adding  two  paths  for  the  same  word  sequence  together,  but  one  or  both  of  them 
is  ignored  because  its  score  is  more  than  10'^  below  the  best  score  at  Ae  state.  However, 
if  the  larger  of  the  two  path  probabilities  was  much  above  the  threshold — say  10  times  the 
threshold  (still  10’^  below  tire  best  score) — then  the  error  due  to  ignoring  the  lower  score  is 
insignificant.  If  both  are  below  the  threshold,  then  when  added,  they  can  at  most  be  twice 
the  threshold — still  quite  low.  Even  if  this  happened  in  every  firame  of  an  utterance — an 
extremely  unlikely  event — ^the  effect  on  the  score  would  be  small  compared  to  the  state 
beamwidth. 

The  result  is  that  the  algorithm  will  correctly  detect  and  score  all  theories  that  are  above 
the  threshold  by  one  order  of  magnitude.  However,  the  score  of  theories  that  are  within  the 
last  order  of  magnitude  of  the  final  beam  may  be  slightly  underestimated.  This  means  that 
the  state  beamwidth  should  be  one  order  of  magnitude  larger  than  needed,  and  the  theories 
within  the  last  order  of  magnitude  can  be  ignored.  When  a  hard  limit  of  JV  is  placed  on 
the  theories  at  each  state,  the  effective  beamwidth  at  that  state  could  decrease.  In  this  case, 
we  must  again  include  any  theories  that  are  within  one  order  of  magnimde  below  the  Nth 
theory  at  the  state  to  ensure  that  the  final  result  is  correct. 


Implementation 


This  algorithm  requires  (at  least)  N  times  the  memory  for  each  state  of  the  hidden  Markov 
model.  However,  this  memory  is  typically  much  smaller  than  the  amount  of  memory 
needed  to  represent  all  the  different  acoustic  models.  We  assume  here,  that  the  overall 
“beam”  of  the  search  is  much  larger  than  the  “beam  at  each  state”  to  avoid  pruning  errors. 
In  fact,  for  the  first-order  grammar,  it  is  even  reasonable  to  have  an  infinite  beam,  since 
the  number  of  states  is  determined  only  by  the  vocabulary  size. 

At  first  glance,  one  might  expect  that  the  cost  of  combining  several  sets  of  N  theories 
(firom  preceding  states)  into  one  set  of  N  theories  at  a  state  might  require  computation  on 
the  onto  of  N^.  However,  we  have  devised  a  “grow  and  prune”  strategy  tiiat  avoids  tins 
problem.  At  each  state,  we  simply  gather  all  of  the  incoming  theories.  At  any  instant, 
we  know  the  best  scoring  theory  coming  to  this  state  at  this  time.  From  this,  we  compute 
a  pruning  threshold  for  the  state.  This  is  used  to  discard  any  theories  tiiat  are  below  the 
threshold.  At  the  end  of  the  firame  (or  if  the  number  of  theories  gets  much  too  large),  we 
reduce  the  number  of  theories  using  a  prune  and  count  strategy  that  requires  no  smting. 
While  this  would  theoretically  still  require  computation  on  the  order  of  N,  it  only  accounts 
for  a  part  of  the  total  computation.  We  find,  empirically,  that  the  overall  computaticm 
increases  with  \/N,  or  slower  than  linear.  This  makes  it  practical  to  use  somewhat  high 
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values  of  in  the  search. 


6.1.4  Rank  of  the  Correct  Answer 


Whether  the  N-best  search  is  practical  depends  directly  on  whether  we  can  assure  that  the 
correct  answer  is  found  reliably  within  the  list  that  is  created  by  the  first  stage.  (ActuaUy, 
if  all  the  remaining  KSs  have  binary  scores,  that  is  they  either  accept  or  reject  a  setttmice, 
then  the  search  is  sufficient  as  Itmg  as  there  is  one  answer  that  is  acceptable,  since  die 
system  could  neva  choose  the  lower  scoring  correct  answer  in  this  case.)  It  is  possible 
that  when  the  correct  answer  is  not  the  top  choice,  it  might  be  quite  far  down  the  list, 
since  there  could  be  exponentially  many  odio*  alternatives  that  score  between  die  highest 
scoring  answer  and  die  correct  answer.  Whether  this  is  true  depends  on  die  power  of 
the  acoustic-phonetic  models  and  the  statistical  language  model  used  in  the  N-best  search. 
Therefore  we  have  accumulated  statistics  of  the  rank  of  the  correct  sentence  in  the  list 
of  N  answers  for  two  different  language  models;  a  first-order  statistical  class  grammar 
(perplexity  1(X))  [35],  and  no  grammar  (perplexity  KXX)).  The  first-order  class  grammar 
constrains  the  probabilities  of  aU  words  in  the  same  class  to  be  die  same,  and  therefore 
can  be  estimated  from  a  small  amount  of  training  data.  The  experiment  was  performed  on 
the  speaker-dependent  portion  of  the  DARPA  1000- Word  Resource  Management  speech 
corpus  [76],  using  the  BBN  BYBLOS  Continuous  Speech  Recognition  System  [31].  The 
test  includes  a  total  of  215  sentences  from  12  speakers. 

Figure  6.2  plots  the  cumulative  distributitm  of  the  rank  for  the  two  differrat  language 
models.  The  (fistribution  is  plotted  for  sentence  JiT  up  to  1(X).  We  have  also  marked  the 
average  rank  on  the  distribution.  The  average  rank  of  die  correct  answer  was  9.3  for  no 
grammar,  and  the  correct  answer  is  not  on  dw  list  at  all  about  20%  of  the  time.  However, 
when  we  use  the  statistical  class  grammar,  which  is  a  fairly  weak  grammar  for  this  domain, 
we  find  that  the  average  rank  is  1.8,  since  most  of  the  time,  the  correct  answer  is  within  the 
first  few  choices.  In  fact,  for  this  test  of  215  sentences,  70%  of  the  sentences  were  correct 
on  the  first  choice,  while  99%  of  the  sentences  were  found  within  the  24  top  choices.  It  is 
also  noteworthy  that  the  acoustic  model  used  in  diis  experiment  is  an  earlier  version  (diat 
does  not  model  coarticulation  between  words  or  use  smoothing  of  poorly  trained  models) 
that  results  in  twice  the  word  error  rate  of  die  most  recent  models.  This  means  diat  the 
likelihood  that  the  correct  answer  will  be  found  within  a  short  list  of  sentences  could  be 
even  higher  than  shown  here  when  the  better  acoustic  models  are  used. 

To  illustrate  the  types  of  lists  diat  are  generated  we  show  below  a  sample  N-best  ouqnit 
In  this  example,  the  correct  answer  is  the  fifth  cme  on  the  list 


Example  of  N-best  Output 


BBN  Systems  and  Technolopes 


BBN  Report  No.  7715 


118 


Figure  6.2:  Cumulative  Distribution  of  Rank  of  Conect  Sentence.  For  die  statistical  class 
grammar,  99%  of  the  sentences  were  recognized  exacdy  within  the  top  24  choices. 

Answer: 

Set  chart  switch  resolution  to  high. 


Top  N  Qioices: 

Set  charts  which  resolution  to  five. 

Set  charts  which  resolution  to  high. 

Set  charts  which  resolution  to  on. 

Set  chart  switch  resolution  to  five. 
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6.2  More  Efficient  N-Best  Algorithms 


Since  we  introduced  the  N-Best  Paradigm,  we  have  invented  several  different  algorithms 
for  finding  the  N-Best  sentence  hypotheses.  The  Sentence-Dependent  N-Best  algorithm  is 
an  exact  procedure  for  finding  all  of  the  sentence  hypotheses  whose  total  score  is  within 
a  threshold  of  the  most  likely  sentence.  The  computation  required  is  linear  in  the  number 
of  hypotheses  found.  We  have  also  developed  two  much  faster  approximate  algorithms; 
the  Word-Dependent  N-Best  Algorithm,  and  the  Lattice  N-Best  Algorithm.  The  Lattice 
algorithm  requires  no  more  time  than  the  usual  1-Best  search,  but  is  not  as  accurate  as 
the  Sentence-Dependent  Algorithm.  However,  the  Word-Dependent  algorithm  requires 
computation  that  is  about  10  times  that  of  tte  1-Best  algorithm,  but  is  empirically  as 
accurate  as  the  exact  method.  A  detailed  comparison  of  the  three  algorithms  and  the 
corresponding  speed  and  accuracy  are  described  in  more  detail  below. 

The  Word-Dependent  algorithm  is  based  on  the  assumption  that  the  beginning  time 
of  a  word  depends  only  on  the  preceding  word.  We  compare  this  algorithm  with  two 
other  algorithms  for  finding  the  N-Best  hypotheses:  The  exact  Sentence-Dependent  metiiod 
reported  in  [32]  and  a  computationally  efficient  Lattice  N-Best  method.  We  show  that, 
while  the  Word-Dependent  algorithm  is  computationally  much  less  expensive  than  the 
exact  algorithm,  it  appears  to  result  in  the  same  accuracy.  However,  the  Lattice  method, 
which  is  still  more  efficient,  has  a  significantly  higher  error  rate. 

We  also  have  demonstrated  that  algorithms  that  use  Viterbi  scoring  (i.e.  they  find  tire 
word  sequences  with  the  most  likely  single  state  sequence)  have  significantly  higher  error 
rates  than  those  that  use  total  likelihood  scoring  (summed  over  all  state  sequences),  (e.g. 
[95,  91]) 

The  Exact  Sentence-Dependent  algorithm  presented  in  Figure  6.1  is  unique  in  that  it 
provided  the  correct  forward  probability  score  for  each  hypothesis  found.  The  basic  idea  of 
the  algorithm  is  that,  if  two  or  more  theories  at  a  state  involve  identical  sequences  of  words, 
we  add  their  scores,  since  we  want  the  likelihood  of  the  sequence  of  words  summed  over 
all  state  sequences.  Otherwise  we  keep  an  independent  score  for  each  different  preceding 
sequence  of  words.  We  preserve  all  different  theories  at  each  state,  as  long  as  tirey  are 
above  the  global  pruning  threshold  and  within  the  state  beamwidth  of  the  best  score  at 
the  same  state.  This  algorithm  guarantees  finding  all  hypotheses  within  a  threshold  of  the 
best  hypothesis.  While  the  proof  i;  not  given  here,  it  is  easy  to  show  that  the  inaccuracy 
in  the  scores  computed  is  bounded  by  the  product  of  the  sentence  lengtii  and  the  pruning 
beamwidth,  which  is  typically  a  very  small  fraction  of  the  score.  While  the  number  of 
theories  that  are  within  a  threshold  of  the  best  themy  at  a  state  could  theoretically  grow 
exponentially  with  time  (if  all  theories  had  about  the  same  score),  we  find  empiric^ly  that 
the  number  of  theories  within  a  threshold  remains  fairly  constant  and  small.  The  algorithm 
was  optimized  to  avoid  expensive  sorting  operations  so  that  it  required  axnputation  diat 
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was  less  than  linear  with  the  number  of  sentence  hypotheses  found.  In  die  remainder  of  the 
section,  we  will  refer  to  this  particular  algorithm  as  the  Exact  or  the  Sentence-Dependent 
algorithm. 

There  is  a  practical  problem  associated  with  the  use  of  this  exact  algorithm.  In  cases 
where  we  need  a  large  munber  of  hypotheses  the  ctxnputation,  which  is  almost  linear 
in  N,  becomes  excessive.  Frequently  we  compute  up  to  100  hypotheses  for  a  sentence. 
In  addition,  when  we  examine  the  different  answers  found,  we  notice  that  many  of  the 
different  answers  are  simple  one-word  variations  of  each  otiier.  Qeariy,  longer  sentences 
would  be  expected  to  have  more  variations,  and  thus  require  a  large  value  of  N.  Thus, 
much  of  the  computation  is  spent  finding  all  combinations  of  independent  variations.  In 
the  next  section,  we  present  two  algorithms  that  attempt  to  avoid  these  problems. 


6.2.1  Two  Approximate  N-Best  Algorithms 


While  the  exact  N-Best  algorithm  is  theoretically  interesting,  we  can  generate  lists  of 
sentences  with  much  less  computation  if  we  are  willing  to  allow  for  some  approximations. 
(It  is  important  to  note  that,  as  long  as  the  correct  sentence  can  be  guaranteed  to  be  within 
the  list,  the  list  can  always  be  retndered  by  rescoring  each  hypothesis  individually  at  the 
end.)  We  present  two  such  approximate  algorithms  with  reduced  computation. 


Lattice  N<Best 


The  first  algorithm  will  derive  an  approximate  list  of  the  N  Best  sentences  with  little  more 
computation  than  the  usual  1-Best  search.  'Mthin  words  we  use  the  time-synchronous 
forward-pass  search  algorithm  [86],  with  only  cxie  theory  at  each  state.  We  add  the  proba¬ 
bilities  of  all  paths  that  come  to  each  state.  At  each  grammar  node  (for  each  firame),  instead 
of  remembering  only  the  best  scoring  word,  we  store  all  of  the  different  words  that  arrive 
at  that  node  along  with  their  respective  sccaes  in  a  traceback  list  This  requires  no  extra 
computation  above  the  1-Best  algorithm.  The  score  for  the  best  hypotiiesis  at  the  grammar 
node  is  passed  forward  as  the  basis  for  future  scoring  as  in  the  normal  time-synchronous 
forward-pass  search.  A  pointer  to  the  stored  list  of  words  and  scores  is  also  sent  tm.  At 
the  end  of  the  sentence,  we  simply  search  (recursively)  tiirough  die  saved  traceback  lists 
for  all  of  the  complete  sentence  hypotheses  that  are  above  some  threshold  below  the  best 
theory. 

Figure  6.3  illustrates  several  alremate  sentence  hypotheses  stored  in  the  traceback.  The 
locations  of  alternate  word  ends  are  indicated  by  filled  circles.  The  resulting  alternate 
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Figure  6.3:  N-Best  Theoiy  Traceback.  The  log-score  differences  from  die  best  sentence 
are  added  to  compute  the  total  sentence  scores  for  each  alternative. 

sentences  are  shown  on  the  left,  ordered  by  score.  Of  course  the  number  of  sentences 
represented  in  this  traceback  lattice  is  huge.  One  algoritiun  for  finding  die  most  likely 
sentences  in  the  traceback  is  presented  below: 


1.  Initialize  (clear)  stack  of  alternate  choices. 

2.  Initialize  accumulated  score  decrement,  a,  to  0.  Initialize  word  position,  t,  to  the  rad 
of  the  sentence. 

3.  Perform  traceback  computatira  from  t  widi  score  a,  chaining  back  through  word- 
ends  to  produce  the  next  highest  setting  sentence.  Add  this  sentence  to  the  list  of 
N-Best  sentence  hypotheses. 

4.  At  each  word  boundary  along  die  traceback,  add  die  accumulated  score  deemnent, 
a,  to  each  of  the  (negative  log)  differences  between  each  alternate  word  score  and  the 
current  sentence  hypothesis.  Add  these  differences  to  the  stack  of  alternate  choices. 

5.  Pick  the  alternate  with  the  smallest  diffrarace  from  the  best  sentence. 

6.  Perform  steps  3  and  4  (recursively)  frenn  this  point  in  the  sentence  until  the  desired 
number  of  hypodwses  has  been  found,  o'  any  remaining  hypodieses  would  score  too 
far  below  the  best  hypothesis  to  be  of  interest 


The  alternate  sentence  hypotheses  are  produced  in  order  of  decreasing  total  score.  This 
recursive  traceback  can  be  performed  very  quickly.  (We  typically  extract  the  100  best 
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answers  in  a  small  fraction  of  a  second.)  We  call  this  algorithm  the  Lattice  N-Best  algorithm 
since  we  essentially  have  a  dense  word  latdce  represented  by  the  traceback  infonnadon. 
An  important  advantage  of  this  algorithm  is  that  it  naturally  produces  more  answers  for 
longer  sentences,  since  the  number  of  permutadons  of  answers  grows  exponentially  with 
the  length  of  the  utterance. 

There  is,  however,  a  serious  problem  with  die  Lattice  N-Best  algorithm.  It  systemati¬ 
cally  underestimates  or  completely  misses  high  scoring  hypotheses.  Figure  6.4  shows  an 
example  in  which  two  different  words  (words  1  and  2)  can  each  be  followed  by  die  same 
word  (word  3).  We  assume  here  that  the  word  sequence  2-3  scores  better  than  1-3.  Hw 
dark  lines  for  words  1  and  2  show  the  optimal  path  for  each  word  when  followed  by  word 
3.  The  gray  lines  show  two  suboptimal  paths  fOT  word  1.  Since  we  allow  only  one  dieory 
at  each  state  within  word  3,  there  is  only  one  best  beginning  time  for  word  3,  determined 
by  the  best  boundary  between  the  best  previmis  word  (wtnd  2  in  the  example)  and  the 
current  word.  But,  as  shown  in  Figure  6.4,  the  seoMid-best  theory  involving  a  different 
previous  word  (word  1  in  die  example),  would  naturally  end  at  a  sli^dy  different  time. 
Thus,  the  best  score  for  the  second-best  theory  would  be  severely  underestimated  or  lost 
altogether.  Thus,  we  cannot  correctly  compute  even  the  second  best  hypodiesis  without 
recomputing  the  likelihood  of  the  different  word  sequence. 


Word-Dependent  N-Best 

As  a  compromise  between  the  exact  sentence-dependent  algorithm  and  the  lattice  algo¬ 
rithm,  we  devised  the  Word-Dependent  N-Best  algorithm.  The  algorithm  is  illustrated  in 
Figure  6.5.  We  reason  that,  w^e  the  best  starting  time  for  a  word  does  depend  on  the 
preceding  word,  it  probably  does  not  depend  on  any  word  before  that  Therefcne,  we 
distinguish  theories  based  on  only  the  previous  word  rather  than  on  the  whole  preceding 
sequence.  At  each  state  within  the  word,  we  preserve  the  total  probability  for  each  of 
n(<<  N)  different  preceding  words.  At  the  end  of  each  word,  we  record  the  score  for 
each  previous  word  hypothesis  along  with  die  name  of  the  jnevious  word.  Then  we  pro¬ 
ceed  on  with  a  single  theory  with  die  name  of  the  word  that  just  ended.  At  die  end  of  die 
sentence,  we  perform  the  recursive  traceback  (similar  to  that  described  above)  to  derive 
the  list  of  the  most  likely  sentences.  As  shown  in  Hgure  6.5,  bodi  sequences  are  available 
at  the  end  of  word  3. 

Like  the  lattice  algorithm,  the  word-dependent  algorithm  naturally  produces  more  an¬ 
swers  for  longer  sentences.  However,  since  we  keep  multiple  theories  within  the  word,  we 
correctly  identify  the  second  best  path  (and  all  others  as  well). 

The  computation  is  proportional  to  n,  the  number  of  theories  kept  locally,  which  is 
typically  3  to  6.  The  number  of  local  theories  only  needs  to  account  for  the  number  of  pos- 
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word 


Figure  6.4:  Alternate  paths  in  the  Lattice  algoddun.  The  best  padi  for  words  2-3  overrides 
the  best  path  for  words  1-3. 

sible  previous  wads — not  all  possible  preceding  sequences.  Thus,  while  Ae  computation 
needed  is  greater  dian  for  the  lattice  algorithm,  it  is  far  less  than  for  the  sentence-dq)endent 
algorithm. 


6.2.2  Comparison  of  N-Best  Algorithms 


We  performed  experiments  to  compare  the  accuracy  of  the  diree  N-Best  algorithms.  In 
all  cases,  we  used  a  weak  fitst-ordn’  statistical  grammar  based  on  100  word  classes.  The 
experiments  were  performed  <m  the  speaker-dependent  subset  of  die  DARPA  1000-Word 
Resource  Management  Corpus  [76].  The  test  set  used  was  the  June  *88  speaker-dqiendent 
test  set  of  300  sentences.  The  test  set  perplexity  is  approximately  100.  To  enable  di¬ 
rect  comparison  with  previous  results,  we  ^d  not  use  models  of  triphones  across  word 
boundaries,  and  the  m^ls  were  not  smoothed.  We  expect  all  three  algorithms  to  improve 
significantly  when  the  latest  acoustic  modeling  methods  are  used. 

Figure  6.6  shows  the  cumulative  distribution  of  the  rank  of  the  correct  answer  for  the 


BBN  Systems  and  Technologies 


BBN  Report  No.  7715 


124 


word 


Figure  6.5;  Altermtc  paths  in  the  Wond-Dqjendcnt  algorithm.  Best  path  for  words  1-3  is 
preserved  along  with  path  for  words  2-3. 

three  algorithms.  As  can  be  seen,  all  three  algmithms  get  the  sentence  correct  on  die  first 
choice  about  62%  of  the  time.  All  duee  cumulative  distributions  increase  substantially  with 
more  choices.  However,  we  observe  that  the  Woid-Dqiendent  algorithm  yields  accuracies 
quite  close  to  that  of  the  Exact  Sentence-Dependent  algorithm,  while  the  Lattice  N-Best  is 
substantially  worse.  In  particular,  the  sentence  error  rate  at  rank  100  (8%)  is  double  diat 
of  the  Word-Dependent  algorithm  (4%).  Therefore,  if  we  can  afford  the  computation  of 
the  Word-Dependent  algorithm,  it  is  clearly  prefored. 

We  also  obsme  in  Figure  6.6  that  the  Wtxd-Dqiendent  algorithm  is  actually  better 
fhfln  the  Sentence-Dependent  algorithm  for  very  high  tanks.  This  is  because  die  score  of 
the  correct  wmd  sequence  fell  outside  die  pruning  beamwidth.  However,  in  die  Word- 
Dependent  algmithm,  each  hypothesis  gets  d«  benefit  of  die  best  dieory  two  words  back. 
Thoefore,  the  correct  answer  was  preserved  in  the  traceback.  This  is  anenher  advantage 
that  both  of  the  approximate  algorithms  have  over  the  Sentence-Dependent  algoridun. 

While  the  new  algorithm  presented  here  is  quite  efficient,  it  still  increases  die  compu¬ 
tation  needed  by  a  significant  factor  over  the  1-Best  search.  Therefore,  we  developed  a 
technique  called  the  Forward-Backward  Search  [5].  The  algorithm  uses  a  forward  Viterbi 
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Figure  6.6:  Comparison  of  the  Rank  of  the  Conect  Sentence  for  the  Sentence-Dependent, 
Word-Dependent,  and  Lattice  N-Best  Algorithms. 

search  to  establish  the  best  score  i<x  any  path  ending  with  each  particular  word  at  each 
time.  This  is  followed  by  a  backwards  versitm  of  the  Word-Depe^nt  N-Best  algorithm, 
in  which  only  those  words  that  scored  well  in  die  forward  pass  are  considered.  The  result  is 
that  the  computation  of  the  N-Best  list  is  reduced  by  a  factor  of  40.  These  two  algorithms 
are  used  in  an  implementation  of  a  spoken  language  system  that  can  operate  in  real  time 
on  a  commercially  available  workstation. 


6.23  Suboptimality  of  Viterbi  Scoring 


There  have  been  several  other  algoridims  prcqxMed  for  finding  multipte  sentence  hypodie- 
ses.  One  algoridim  ([95,  62])  is  quite  similar— widi  some  important  implemmitational 
differences — to  the  Lattice  N-Best  algorithm.  This  algorithm,  as  described  above  suffers 
severely  from  the  problem  that  it  only  cmisiders  alternate  sequences  diat  diverge  from  die 
exact  txiundaries  found  for  the  higher  scoring  paths.  The  second  algorithm,  imiposed  in 
[91]  is  the  Ttee-TVellis  algorithm.  This  algorithm  starts  with  the  same  forward-pass  as 
used  in  the  Forward-Backward  Search.  However,  it  uses  a  stack  search  in  the  backwards 
direction  to  find  the  N-Best  word  sequences.  It  must  rescore  each  hypothesis  explicidy  in 
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the  backwards  pass  in  order  to  guarantee  finding  die  best  sentences.  Thus  die  computation 
needed  is  pioportitmal  to  N,  which  may  be  a  problem  for  large  N.  CThe  Wotd-Depmident 
algorithm  requires  ccrniputadon  that  is  proportimial  to  the  number  of  local  hypodieses  only, 
which  is  much  smaller  than  N.) 


rank  of  correct  answer 


Figure  6.7 :  Word-Dependent  Algorithm  using  Vitnbi  scoring  vs  forward-likelihood  scoring. 
At  a  rank  of  100,  >fiteibi  scoring  misses  twice  as  many  sentences. 

Both  of  the  algorithms  mentioned  above  are  based  on  ^tmbi  scoring.  That  is,  diey 
find  the  word  sequence  corresponding  to  the  most  likely  single  state  sequence.  However, 
we  should  sum  the  probability  over  dl  possible  state  sequences  that  correspcmd  to  a  word 
sequence.  Figure  6.7  shows  a  cumulative  distribution  ^  die  rank  of  the  correct  answer 
for  the  two  types  of  scoring  within  the  Word-Dependmit  N-Best  algorithm.  At  all  ranks, 
there  is  a  5%  increase  in  die  likelihood  of  finding  die  correct  sentoice  uiimi  we  sum  die 
probability  over  state  sequences.  While  dus  Terence  may  not  be  important  for  rank  1, 
where  the  sentence  oror  rate  is  38%,  it  refaesents  a  factor  of  two  in  die  error  rate  at  a 
rank  oi  100.  Thus,  we  feel  that  it  is  essential,  within  an  N-Best  search,  to  use  die  total 
likelihood  score  for  the  wmd-sequence. 
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6.2.4  Summary 


In  summary,  we  have  considered  several  approximations  to  the  exact  Sentence-Dependent 
N-Best  algorithm,  and  evaluated  them  thorou^y.  We  show  that  an  approximaticm  that 
only  separates  theories  when  the  previous  words  are  different  allows  a  significant  reduction 
in  computation,  makes  the  algoritiim  scalable  to  long  sentences  and  less  susceptible  to 
pruning  errors,  and  does  not  increase  the  search  errors  measurably.  In  amtrast,  the  Lattice 
N-Best  algorithm,  which  is  still  less  expensive,  appears  to  miss  twice  as  many  sentences 
within  the  N-Best  choices. 


6.3  New  Uses  for  the  N-Best  Sentence  Hypotheses  Within 
the  BYBLOS  Speech  Recognition  System 


Since  developing  the  N-Best  Paradigm  for  the  integration  of  speech  with  NL,  we  have 
found  many  other  uses  for  it.  Below,  we  describe  four  different  ways  in  which  we  use  the 
N-Best  paradigm  within  the  BYBLOS  system.  The  most  obvious  use  is  for  the  efficient 
integration  of  speech  recognition  with  a  linguistic  natural  language  understanding  module. 
However,  we  have  extended  this  principle  to  several  other  acoustic  knowledge  sources. 
We  also  describe  a  simple  and  efficient  means  for  investigating  and  incorparating  arbitrary 
new  knowledge  sources.  The  N-Best  hypotheses  are  used  to  provide  close  alternatives 
for  discriminative  training.  Finally,  we  have  developed  a  simple  technique  that  allows 
us  to  optimize  several  weights  and  parameters  widtin  a  system  in  a  way  tiiat  directly 
minimizes  word  error  rate.  These  techniques  were  invaluable  in  our  expoimental  work 
on  the  ATIS  domain,  in  which  we  would  like  to  use  powerful  knowledge  sources  such  as 
trigram  statistical  models  of  language. 

While  most  of  these  uses  are  not  search  strategies,  tiiey  are  all  related  in  that  we  use 
the  N-Best  process  to  provide  a  reduced  set  of  choices  that  require  special  attention.  In  the 
case  of  a  search  strategy  we  can  apply  additional  KSs  to  make  the  final  chdce,  while  in 
the  case  of  discriminative  training,  we  modify  die  parameters  of  our  model  specifically  to 
reject  diose  wrong  answers  that  our  system  is  likely  to  choose.  In  the  remaining  sections 
we  discuss  each  of  the  different  uses  of  the  N-Best  paradigm  within  the  BYBLOS  system. 
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6.3.1  Efficient  Search  Strategies 


The  initial  use  of  the  N-Best  paradiftn  to  integrate  speech  recogniticm  and  natural  language 
within  a  spoken  language  system  is  an  example  of  a  search  strategy  optimization.  That 
is,  we  reduce  the  number  of  choices  using  an  inexpensive  acoustic  mo^l  and  statistical 
language  model  before  applying  the  more  expensive  linguistic  understanding  model  to 
verify  that  a  sentence  hypothesis  is  meaningful.  Given  that  a  linguistic  natural  language 
is  typically  not  much  mtne  powerful  than  a  simple  statistical  n-gram  language  model,  yet 
is  much  more  complex,  we  find  it  more  cost  effective  to  use  the  statistical  n-gram  mottel 
in  an  N-Best  algorithm  to  generate  all  the  likely  sentence  hypotheses,  and  then  apply  the 
linguistic  language  model  to  the  resulting  alternatives.  That  is,  we  use  the  N-Best  algoritiim 
as  a  sentence-level  “fast  match”.  The  question,  then,  is  whether  the  accuracy  decreases 
when  using  this  strategy,  since  it  is  not  admissible.  However  we  showed  in  [32]  that, 
given  powerful  acoustic  and  language  models,  the  correct  sentence  is  found  within  the  set 
of  proposed  sentence  hypotheses  a  very  high  percentage  of  the  time.  In  subsequent  work, 
we  found  that  even  in  those  cases  where  the  onrect  answer  was  absent,  there  was  almost 
always  a  higher  scoring  answer  that  was  accqjtable  to  die  NL  module.  This  means  that  a 
more  tightly  coupled  search  could  not  have  resulted  in  lower  error  rate. 

We  have  extended  the  basic  idea  outlined  above  to  several  other  expensive  knowledge 
sources  that  we  use  in  our  continuous  speech  recognition  system.  Some  examples  of 
these  are:  higher-order  statistical  language  models,  cross-word  coarticulation  models,  semi- 
continuous  tied-mixture  models,  and  whole-segment  acoustic  models.  Each  of  these  models 
can  result  in  higher  recognition  accuracy,  but  at  a  significant  increase  in  computaticmal  cost 
and  complexity.  Instead,  we  are  able  to  get  the  benefits  of  these  man  detailed  models  at 
a  fraction  of  the  cost. 


High-Order  Language  Models 


Time-synchronous  speech  recognition  using  a  bigram  statistical  language  is  quite  efficient 
However,  if  sufficient  text  training  is  available,  we  can  derive  a  mme  pownfiil  language 
model  of  sequences  of  three  or  more  words  or  word  classes.  But  the  computation  needed 
to  use  a  trigram  language  model  in  a  time-syochrcmous  search  is  proportional  to  the  square 
of  the  number  of  words  or  classes,  making  it  impractical.  Instead,  we  use  a  bigram 
model  (of  word  classes)  when  we  find  the  N-Best  hypotheses,  and  then  ctnnpute,  for 
each  text  hypothesis,  the  language  model  probability  using  a  3-gram  or  4-gram  model. 
The  computation  and  substitution  of  the  higher-order  language  model  score  for  a  particular 
known  sequence  of  words  and  subsequent  recombination  with  acoustic  score  requires  almost 
no  computation,  and  can  thus  be  accomplished  with  no  perceivable  delay.  Again,  there 
is  the  risk  that  we  will  make  a  search  error  by  using  the  weaker  language  model  to  filter 
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out  most  of  the  choices.  For  example,  the  perplexity  of  the  higher-order  language  model 
might  be  30%  less  than  the  perplexity  of  the  bigram  model.  In  principle,  this  means  that 
the  average  number  of  sentences  predicted  by  the  bigram  model  may  be  a  few  orders  of 
magnitude  more  than  predicted  by  the  trigram  model,  implying  that  we  would  need  to  have 
N-Best  lists  that  include  thousands  or  millions  of  sentences  to  insure  having  the  optimal 
answer.  However,  this  analysis  ignmes  the  fact  that  the  added  linguistic  infonnatitm  is 
smaller  when  used  together  with  an  acoustic  model,  as  in  speech  recognition.  (In  other 
words,  the  increase  in  mutual  information  is  small.)  Thus,  we  find  empirically,  that  as  long 
as  the  utterances  are  not  too  long,  the  highest  scoring  sentence  using  a  trigram  model  is 
never  past  rank  100  when  using  a  bigram  language  model.  (We  divide  very  long  utterances 
into  sections  and  find  the  N-Best  sentences  for  each  section.) 

This  is  not  to  say  that  we  can  always  use  any  weak  language  model  to  find  the  N-Best 
sentences.  For  example,  if  we  use  a  model  that  simply  allows  any  sequence  of  words, 
we  find  that  the  conect  answer  is  not  in  a  list  of  100  sentence  alternatives  a  sufficiently 
high  percentage  of  the  time.  This  is  also  true  when  we  are  dealing  with  severely  degraded 
speech,  where  the  acoustic  information  is  weaker.  Thus,  we  must  determine  for  each  case 
which  language  model  should  be  used  for  finding  the  N-Best  sentences.  Luckily,  it  is  easy 
to  determine  during  development  when  we  have  made  a  search  error  with  this  paradigm. 
All  we  have  to  do  is  to  (artificially)  include  the  correct  answer  among  the  choices.  If  we 
find  that  this  improves  the  accuracy,  then  this  tells  us  that  we  have  made  a  search  error  by 
using  the  weaker  language  model  first 


Cross-Word  Coarticulation  Models 

One  of  the  improvements  that  has  been  demonstrated  in  recent  years  is  the  use  of  context- 
dependent  models  that  span  the  boundaries  between  words.  It  has  been  shown  that  modeling 
acoustic  coarticulation  between  words  (cross-wmd  models)  reduces  the  recognition  error 
rate  by  a  significant  factor — ^typically  30%.  While  the  principle  is  the  same  as  modeling 
coarticulation  within  words,  there  is  a  large  difference  in  tiie  cost  Since  tiie  phonetic 
pronunciations  of  a  word  are  essentially  a  linear  sequence  with  few  alternatives,  we  can 
replace  each  phoneme  model  within  a  word  with  tiie  apprt^niate  model  that  depends  on 
the  preceding  and  following  phoneme  without  significantly  changing  the  tc^logy  of  tte 
model  of  the  word.  However,  if  we  include  models  that  depend  on  nei^boring  words,  it 
interacts  witii  the  choices  in  the  grammar.  In  order  to  use  cross-word  models,  we  must 
compile  a  triphone  network  grammar  that  has  a  laige  number  of  initial  and  final  acoustic 
models  for  each  word  and  correctly  incorporates  the  constraints  depending  on  the  preceding 
and  following  words.  This  grammar  and  the  resulting  computation  is  quite  large. 

Again,  we  can  use  the  simpler  models  (that  model  within-word  coarticulation  only) 
to  compute  the  N-Best  hypotheses  and  then  rescore  each  hypothesis  using  cross-word 
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models.  To  do  this  we  must  compile  a  “grammar”  for  each  sentence  hypodiesis  that 
incorporates  the  cross-word  models  for  that  sentence,  and  dten  use  dtis  gramnuu’  to  compute 
the  probability  of  the  acoustics  (i.e.  compute  the  fmwaid  probability)  independently.  Since 
each  hypothesis  is  a  known  linear  sequence  of  words,  the  resulting  granunar  is  quite  simple 
and  requires  very  little  computadcm  to  evaluate  the  correct  score. 


S«ni-Continuoiis  Dmsity  HMM  Models 


HMMs  that  use  semi-continuous  [45]  or  ded-mixture  densities  [16]  have  been  shown  to 
result  in  sli^tly  less  error  than  the  corresponding  discrete  density  HMMs.  Semi-continuous 
density  models  are  like  discrete  models,  except  that  rather  tiian  representing  the  input  witii 
the  index  of  the  closest  prototype  vector,  we  preserve  several  (e.g.  tiie  5)  most  likely 
regions  of  the  input  space,  along  with  their  probabilities.  This  amounts  to  a  smoothing 
of  the  probability  density  at  a  state  between  regions  in  the  input  space.  However,  the 
computation  for  evaluating  the  spectral  obs^ation  probabilities  increases  several  fetid, 
(usually  by  a  factor  of  5  to  10),  resulting  in  a  rntal  increase  in  computation  by  a  factor  of  2  or 
more.  Consequently,  we  have  ermsidered  sevoal  alternatives  to  using  die  semi-ctmtinuous 
densities  in  all  stages.  First,  we  found  that  if  we  just  use  semi-ctmtinuous  densities  during 
the  training,  but  use  the  discrete  (tt^  1)  model  during  recognition,  we  obtain  most  of  the 
gain  attributed  to  semi-continuous  densities,  hi  order  to  gain  the  remaining  error  reduction, 
we  include  the  top  5  VQ  bins  during  the  rescoring  pass.  Although  the  semi-continuous 
density  computation  in  the  rescoting  pass  is  also  more  erqiensive,  this  cost  can  be  minimized 
because  the  different  sentence  hypotheses  in  the  N  alternatives  typically  only  span  a  small 
fraction  of  the  phonetic  models.  Therefore,  wlren  we  rescore  each  of  diese  alternatives,  die 
extra  cost  for  using  semi-continuous  densities  can  be  made  small. 


6.3.2  Investigating  and  Using  New  KSs 


Usually  in  order  to  evaluate  die  utility  of  a  new  KS,  it  must  be  implemented  in  a  reasonably 
efficient  recognition  search.  But  this  is  mx  always  easy.  For  example,  a  segmental  model 
of  speech  (diat  models  whole  segments  as  a  single  unit  rather  dum  as  individual  frames 
like  an  HMM)  requires  a  tranendous  amount  of  emnpotation,  making  die  research  with 
this  new  model  difficult  A  model  of  sentential  stress  may  require  relative  measurements 
of  pitch,  loudness,  and  phtmeme  duration  over  Icmg  intervals  in  a  sentence,  and  may  also 
require  relating  these  measurements  to  a  syntactic  grammar.  Often,  die  new  KS  is  ikk 
expected  to  pofoim  the  whole  recognition  problem  by  itself.  often  die  issue  of  what 
information  the  KS  provides  and  how  to  use  it  in  an  integrated  system  are  conflated. 

The  first  question  we  want  to  ask  about  a  new  KS  is  not  what  recognition  accuracy 
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results  from  using  this  KS  by  itself  or  how  it  interacts  in  a  search  strategy,  but  how  much 
information  it  adds  to  the  KSs  that  are  already  being  used.  Only  after  we  determine  that 
it  is  useful  do  we  want  to  consider  the  most  efficient  way  to  use  it  The  problem  of 
evaluating  a  new  KS  is  made  simple  using  the  N-Best  paradigm.  The  only  requirement 
for  evaluating  (or  using)  a  new  KS  is  that  there  be  a  way  to  produce  a  seme  for  a  given 
sentence  hypothesis.  The  scoring  mechanism  does  not  need  to  be  able  to  operate  in  a 
left-to-right  manner  or  on  partial  sentences.  Rrst  we  start  with  a  long  list  of  the  N-Best 
hypotheses  for  each  of  the  utterances  in  a  development  test  set  (If  the  correct  answer  is 
not  included  in  the  list  we  can  artificially  include  it  in  order  that  the  new  KS  will  have  a 
chance  to  improve  its  score  sufficiently  to  raise  it  above  aU  the  mher  hypoffieses.)  Then 
we  compute  the  score  for  each  hypothesis  using  the  new  KS.  We  find  ffie  c^timal  linear 
weights  for  combining  the  score  of  the  new  KS  with  those  of  the  previous  KSs  using 
the  technique  described  in  subsection  S.  The  decrease  in  word  or  sentence  error  rate — ^if 
any — tells  us  how  useful  this  new  KS  is  in  the  context  of  the  total  system. 

If  the  new  KS  is  found  to  be  useful,  then  the  same  method  provides  a  simple  and 
efficient  way  to  integrate  it  into  the  system.  Of  course,  if  this  new  KS  is  powerful  enough, 
and  not  expensive,  we  may  want  to  consider  ways  to  incorporate  it  at  an  earlier  stage. 


Segmental  Models 


We  have  used  this  paradigm  to  utilize  segmental  models  for  speech  recognitimi.  Two 
examples  of  such  models  are  the  Stochastic  Segment  Model  [70]  and  the  Segmental  Neural 
Network  (SNN)  [7].  The  advantage  of  a  segmental  model  of  speech  is  that  it  can  take  into 
account  dependency  between  frames  within  a  phoneme.  However,  the  computation  required 
for  recognition  is  greatly  increased.  Even  diough  we  can  use  a  dynamic  programming 
update  to  compute  different  path  scores,  we  must  explicitly  consider  different  segment 
durations  for  each  possible  ending  time  for  a  segment,  resulting  in  an  order  of  magnitude 
more  computation  than  required  for  typical  frame-based  HMM  models.  The  computational 
and  search  issues  related  to  segmental  models  are  avoided  by  using  die  N-Best  paradigm, 
and  we  can  measure  directly  the  improvement  when  added  to  our  best  HMM  system. 

One  byproduct  of  rescoring  the  N-Best  alternatives  with  the  HMM  is  the  best  phoneme 
sequence  along  with  the  best  corresponding  segmentation  for  each  sentence  hypothesis. 
Each  of  these  hypotheses  can  be  rescored  quickly  using  the  segment  model,  since  ffiere 
is  no  ambiguity  as  to  the  segmentation.  If  we  alk)w  the  segmentatitm  to  be  different  for 
the  segmental  model  than  for  the  HMM,  we  can  speed  up  die  search  tremendously  by 
constraining  the  segmentation  to  be  close  to  that  of  the  HMM. 

This  approach  has  made  it  possible  to  consider  using  models  that  would  otherwise 
be  too  expensive.  While  the  recognition  accuracy  with  the  two  new  segmental  models 


BBN  Systems  and  Tbchncdogjes 


BBN  Repott  No.  7715 


132 


was  not — ^by  itself — superiOT  to  that  of  the  HMM,  we  found  duu  when  we  ctnnbined  the 
new  scores  with  the  previous  scores,  we  were  able  to  obtain  better  perfOTmance  duui  with 
either  model  alone.  Specifically,  in  the  case  of  the  SNN,  we  performed  experiments  on 
the  DARPA  lOOO-Word  Resource  Management  Corpus.  We  were  able  to  reduce  die  word 
error  rate  on  the  October  ’89  speaker-independent  test  set  from  4.1%  to  3.0%  by  using  the 
combined  HMM  and  SNN  scenes  [7].  This  rqnesents  a  substantially  lower  error  rate  dian 
previously  reported  cm  this  corpus. 


633  Discriminative  Training 


Several  methods  for  discriminative  training  have  been  proposed.  These  methods  generally 
need  a  mechanism  for  creating  errors  and  near  misses  in  order  that  the  correct  answers 
can  be  made  more  likely,  while  the  errors  and  all  of  the  near  misses  can  be  made  less 
likely.  Usually,  ad  hoc  techniques  are  used  to  derive  a  list  of  near  misses  for  each  wend 
independendy.  Instead,  we  compute  die  N-Best  sentence  hypotheses,  and  rescore  them 
using  all  of  the  KSs  in  the  system.  This  means  that  any  errors  that  show  up  in  die  final 
list  are  real  errors  diat  the  system  is  likely  to  make,  and  we  should  attempt  to  remove 
them  by  discriminative  training.  Conversely,  if  a  particular  error  is  unlikely,  e.g.  because 
the  grammar  makes  it  unlikely,  we  won’t  observe  it  and  we  don’t  need  to  spend  modeling 
effort  to  avoid  that  error.  We  have  reported  on  using  this  method  for  MMI  training  [30] 
of  the  codebook  weights  in  a  multiple  codebodic  discrete  density  HMM  system. 

This  method  has  been  most  successful  within  the  context  of  the  Segmental  Neural 
Networks(SNN).  In  our  initial  implementation  of  SNN  we  trained  die  networks  (»  qieech 
that  was  phonetically  segmented  using  trained  HMM  models.  Each  segmented  phoneme 
served  as  a  positive  example  of  its  own  model  and  a  negative  example  for  all  of  the  other 
models  in  the  usual  maimer  of  neural  network  training.  There  are  two  problems  with  die 
approach.  First,  the  SNN  is  required  to  model  all  of  the  distinctions  between  each  phoneme 
and  all  of  its  neighbors.  But  this  job  is  already  being  done  quite  well  by  die  HMM.  Second, 
the  SNN  is  only  trained  on  correct  segmentatitms.  But  we  know  that  in  practice,  almost  all 
recognition  errors  also  result  in  a  different  phtmetic  segmentation  than  that  of  the  correct 
answer.  To  be  effective  as  a  discriminator,  the  SNN  must  learn  to  reject  these  incorrect 
segmentations.  However,  rather  than  training  the  segmental  neural  networits  on  all  possible 
segmentations  and  labelings,  we  would  like  it  to  concentrate  all  of  its  modeling  cs^pability 
on  eliminating  those  few  errors  made  by  the  HMM. 

First,  we  use  the  correct  labeling  and  segmentation  as  provided  by  die  HMM  models 
as  examples  of  the  correct  models.  Then,  we  use  the  N-Best  list  (rescored  with  detailed 
HMMs  and  language  models)  as  examples  of  likely  misiecognitions  by  the  HMM.  Wb 
determine,  for  each  incorrect  hypothesis,  which  phonemes  differed  from  the  ctnrect  answer 
in  labeling  and/or  segmentation.  The  difference  usually  only  consists  of  a  few  phememes  in 
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each  hypothesis.  These  inconect  phonemes  are  used  as  negative  examples  for  training  the 
SNN.  TTie  result  of  this  training  mediod  is  that  the  models  are  better  able  to  complement 
the  HMM  models  as  an  additional  KS. 


63.4  Optimization  Of  System  Parameters 


In  addition  to  the  large  number  of  acoustic  and  language  model  parameters  in  a  recognition 
system,  there  are  several  system  parameters  that  must  be  tuned  for  optimal  performance. 
Many  of  these  caimot  be  estimated  directly  using  die  same  techiuques  (e.g.  maximum  like¬ 
lihood).  Some  examples  of  these  parameters  are:  word  and  phoneme  insertkm  penalties, 
the  grammar  weight,  the  codebook  weights,  and  die  weights  of  alternate  acoustic  models. 
The  word  insertion  penalty  is  an  additional  probability  that  we  multiply  by  for  each  tran¬ 
sition  to  a  new  word  (in  addition  to  the  grammar  probability).  This  is  used  to  ctmtrol  the 
balance  between  insertions  and  deletions.  The  language  model  wei^t  is  an  exptment  on 
each  grammar  probability  that  allows  us  to  obtain  the  best  balance  between  the  acoustic 
model  scores  and  the  language  model  semes.  Hnally,  each  different  acoustic  model  is 
weighted  by  exponentiating  dw  probabilities  according  to  dieir  relative  power. 

Clearly  we  would  like  to  set  these  parameters  to  optimize  recognition  accuracy  direedy. 
However,  maximum  likelihood  estimation  techniques  cannot  be  used  to  estimate  diese 
exponent  parameters.  Therefore,  we  typically  tun  several  recogititimi  experiments — each 
requiring  a  few  hours — to  try  to  find  dw  best  system  parameters.  However,  this  tuning 
often  requires  extensive  experience  and  too  many  ejqjerimrats. 

The  total  probability  of  a  sentence  hypothesis  can  be  expressed  as  the  product  of  die 
exponentiated  probabilities  a[  each  KS. 


Utt-score  ®  HMMScore®  * 

GrammarScore^  * 
WordPenalty*-*^  * 
HionePenalty'''**^* 


The  unknown  values  are  the  exponents  a  and  )9,  and  the  WordPenalty,  and  PhonePenalty. 
If  we  take  the  log,  we  have 


log  Utt-score  s  log  HMMScore  *  a  + 

log  GrammarScore  *  0  + 
^hoards  *  log  WradPenalty  + 
Itphcnea  *  log  PhonePenalty 
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Now,  the  unknown  values  on  the  right  are  just  linear  weights  for  the  KSs  on  the  left. 
Admittedly  the  nrunber  of  words  and  phones  are  simple  KSs,  but  we  find  that  including 
these  terms  significantly  improves  recognition  accuracy.  We  need  to  find  die  four  values 
that  minimize  the  error  rate.  While  minimizing  error  rate  directly  for  continuous  speech 
is  usually  difficult,  it  becomes  easy  if  we  change  the  problem  to  one  of  minimizing  the 
error  rate  when  choosing  among  the  N-Best  alternatives  for  an  utterance.  First,  we  find 
the  N-Best  hypotheses  for  all  of  the  utterances  in  a  development  test  set  The  rescoring 
step  provides  the  log  probabilities  for  each  hypothesis  for  each  KS  separately.  We  use  a 
gradient  search  to  find  the  set  of  weights  that,  averaged  over  the  development  set  brings 
to  the  top  the  answer  with  the  smallest  number  of  errors.  To  evaluate  a  particular  set  of 
weights  we  compute  the  total  weighted  log  score  for  each  hypothesis  (the  dot  product  of  the 
weights  and  scores),  and  then  find  the  hypothesis  with  the  maximum  total  score  for  each 
utterance.  We  measure  the  word  error  rate  for  this  top  choice  for  each  utterance  in  the  set 
in  the  usual  way.  The  total  word  error  rate  over  die  set  is  our  evaluation  function  for  this 
set  of  weights.  The  computation  needed  to  evaluate  a  set  of  weights  for  100  hypotheses 
for  300  test  utterances  can  be  measured  in  milliseconds.  Therefore  we  can  consider  several 
thousand  weight  vectors  in  a  few  seconds  in  our  search  for  the  set  of  weights  that  minimizes 
word  error  rate  on  the  development  set  As  long  as  the  development  test  set  contains  enough 
utterances — say  300 — we  find  that  the  weights  found  are  also  good  for  new  test  sets. 


6.3.5  Summary 

We  have  described  several  new  uses  for  the  N-Best  paradigm.  As  originally  intended, 
it  is  useful  for  decreasing  computation  when  using  several  different  knowledge  sources 
for  speech  recognition.  But  it  has  also  been  shown  to  be  quite  useful  for  discriminative 
training  of  several  different  types,  and  for  directly  optimizing  final  system  performance, 
which  previously  required  an  exhausting  number  of  experiments. 


6.4  Integration  with  Natural  Language 


We  experimented  with  several  conditions  to  optimize  the  ccmnection  of  BYBLX)S  with 
DELPHI.  The  basic  interface  between  speech  and  natural  language  in  HARC  is  die  N-best 
list.  In  cor  initial  implementation  of  this  integration  strategy,  we  simply  allowed  the  natural 
language  component  to  search  arbitrarily  far  down  the  N-best  list  until  it  either  found  a 
hypothesis  that  produced  a  database  retrieval  or  reached  the  end  of  the  N-best  list  Over 
the  course  of  the  rest  of  this  project  we  explored  the  nature  of  this  connection  in  mcne 
detail.  The  parameters  we  varied  were: 
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•  the  depth  of  the  search  that  NL  performed  on  die  N-best  output  of  speech 

•  the  processing  strategy  used  by  NL  on  the  speech  output 


In  our  examination  of  the  data  from  the  February,  1991  cross-site  evaluation,  we  had 
noticed  that  while  it  was  beneficial  for  NL  to  look  beyond  the  first  hypothesis  in  an  N-best 
list,  the  answers  obtained  by  NL  from  speech  ou^ut  tended  to  degra^  the  further  down  in 
the  N-best  list  they  were  obtained.  Subsequently,  we  performed  a  number  of  eiqieriments 
to  determine  the  break-even  point  for  NL  search.  We  used  an  N  of  1,  5,  10,  and  20  in  our 
experiments. 

During  our  most  recent  development  work,  we  utilized  the  fail-back  strategies  NL 
text  processing  described  in  4.3.  In  applying  these  fall-back  strategies  to  speech  output,  we 
examined  the  trade-off  between  processing  speech  output  with  a  more  restrictive  scheme, 
and  thereby  potentially  discarding  meaningfril  utterances  vs.  processing  speech  output  with 
a  more  forgiving  strategy,  and  thereby  potentially  allowing  in  meaningless  ot  misleading 
utterances.  We  experimented  with  three  processing  strategies: 


•  fallback  processing  turned  off 

•  fallback  processing  turned  on 

•  a  combined  strategy^  in  which  an  initial  pass  with  made  witii  fallback  processing 
turned  off.  If  no  hypothesis  produced  a  database  retrieval,  a  second  pass  was  made, 
with  the  fallback  strategy  engaged. 


We  show  the  results  of  one  such  experiment,  utili^g  the  October,  1991  dry-run  corpus 
as  development  test  in  Table  6.1. 

The  results  of  our  experiments  indicated  tiiat  an  N  of  5  was  optimal,  and  that  the 
two-pass  processing  strategy  was  slightly  better  than  either  of  the  otiiers.  This  was  the 
configuration  we  used  on  tiie  February,  1992  cross-site  evaluation  data. 

We  repeat  here  our  results  on  the  February,  1992  cross-site  evaluation  data,  originaUy 
presented  in  Table  4.3. 
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(Condition 

N 

WE 

Text 

(1) 

47.9 

Fallback  on 

1 

64.6 

9f 

5 

S8.0 

99 

20 

60.1 

Fallback  off 

1 

64.2 

99 

5 

56.9 

99 

20 

59.0 

Two  Pass 

5 

56.6 

Table  6.1:  SLS  weighted  error  (WE)  on  die  October  ’91  dry-run  test  set  with  varying 
N-best  list  lengdi  (N). 


%F 

%NA 

%WE 

syn  only 

"S55" 

IIJ 

19.7“ 

42.6 

frame  cmly 

115" 

13.2 

39.2 

bodi(ofiBc^ 

71.8 

15.4 

12.8 

43.7 

both(adjustBd) 

71.^ 

15.1 

13.0 

43.2 

Table  6.2:  SLS  Results,  February,  1992 


Chapter  7 


Real-Time  Speech  Recognition 


One  of  the  major  goals  of  this  contract  has  been  die  development  of  real-time  spoken 
language  systems.  It  has  Icmg  been  believed  that  die  ctxnputaticm  required  for  speech 
recognition  and  understanding  is  beyond  die  capability  of  general  purpose  computers.  The 
search  space  for  speech  recognition  alone  is  large,  in  additkm,  if  we  include  a  natural 
language  model,  the  search  space  for  speech  becomes  much  more  complex,  and  the  nat¬ 
ural  language  computation  can  be  many  orders  of  magnitude  more  than  required  for  text 
processing. 

The  traditional  lyiproach  to  solving  this  problem  has  been  the  developmmit  of  special 
purpose  hardware,  or  the  use  of  large  parallel  jnocessor  systems.  Unfortunately,  bodi  of 
these  tpproaches  present  new  problems  of  their  own.  In  addititxi,  the  speed  advantages 
from  these  qiproaches  may  be  no  more  than  a  factor  of  10.  Frequendy,  befoe  these 
hardware  efforts  finish,  die  speed  of  the  general  purpose  machines  has  increased  to  the 
point  where  there  would  be  no  advantage  for  die  special  purpose  hardware. 

We  have  come  to  believe  that  die  most  profitable  qqiroach  to  die  problem  is  to  diink 
about  algorithms  that  give  the  same  accuracy  widi  a  small  fraction  of  the  computation.  Ws 
believe  diis  because  we  have  beoi  successful  in  developing  many  such  qqxoaches  ttdudi, 
taken  together,  save  several  orders  of  magnitude  in  computation  with  no  loss  in  accuracy. 
Consequendy,  our  goal  has  been  to  use  readily  available  woricstations  for  the  computation 
engines,  and  obtain  increased  speed  from  a  cmnbination  of  new  search  algorithms  and  die 
natural  evolutitm  in  hardware  speed. 

In  this  chapter,  we  describe  some  of  the  algorithms  that  we  have  developed  to  enable 
real-time  spoken  language  understanding.  In  particular,  these  are  the  Forward-Backward 
Search  Algoridtm  and  an  efficient  algorithm  for  decoding  with  statistical  n-gram  grammars. 
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We  also  demonstrate  amvincingly  that,  wten  searching  for  the  most  likely  sequence  of 
wofxls,  it  is  better  to  add  the  probabilities  from  different  state  sequences  for  a  given  word 
sequence  (the  Forward  Probability  Search)  rather  dian  simply  to  find  the  word  sequence 
corresponding  to  the  most  likely  state  sequence  (die  Viterbi  algorithm).  We  also  describe 
the  hardware  that  we  used,  and  review  some  of  the  demonstrations  that  we  have  given 
along  the  way. 


7.1  Forward-Backward  Search  Algorithm 


Despite  the  advances  in  the  speed  of  the  computaticm  of  die  N-Best  sentences,  it  sdll  requires 
mme  computation  than  we  would  like  to  spend.  We  have  developed  a  general  technique 
that  gieady  ^leeds  up  expensive  time-synchronous  beam  searches  in  speech  recognition. 
The  algorithm  is  called  Ae  Forward-Backwwl  Search  and  is  mathematically  related  to 
the  Baum-Welch  fcHward-backward  training  algoridim.  It  uses  a  simplified  forward  pass 
followed  by  a  detailed  backward  search.  The  information  stored  in  the  forward  pass  is  used 
to  decrease  the  computation  in  the  backward  pass  by  a  large  factor.  We  have  observed  an 
increase  in  speed  of  a  factor  of  40  with  no  mcrease  in  search  errors. 


7.1.1  Introduction 

As  speech  recognition  algorithms  advance  in  cmnplexity,  so  does  the  amount  of  cmnputing 
power  they  require.  At  some  point  we  start  looking  for  ways  to  make  them  run  faster,  boA 
because  we  want  to  run  expeziments  in  a  reasonable  amount  of  time  and  also  because  we 
want  to  incorporate  these  new  algoiithms  into  real-time  speech  understanding  systems.  The 
Forward-Backward  Search  (FBS)  can  be  used  to  obtain  increases  in  speed  of  iq>  to  40  in 
time-synchronous  algorithms  by  using  informatitm  obtained  in  a  fast,  simplified  algoridim. 
It  is  so  called  because  of  its  similarities  to  the  Baum-Welch  forward-backward  training 
algorithm. 

First,  we  Iniefly  describes  how  fast-match  algoridims  work  and  identifies  some  trade-<^ 
problems  in  dieir  implementation.  Next,  we  will  describe  die  Forward-Backward  Search 
and  subsectkm  4  des^bes  stnne  ctmsideratimis  in  its  implementation.  Hnally,  we  describe 
how  we  use  the  FBS  in  our  real-time  speech  understanding  system. 
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7.1.2  Fast-Matr  i  Algorithms 


A  time-synchronoi  s  HMM  CSR  system  works  by  taking  each  frame  of  an  utterance  and 
simultaneously  matching  that  frame  to  a  set  of  HMMs.  In  order  to  limit  the  amount  of  wok 
performed  by  the  recognizer,  it  is  customary  to  ^>ply  a  pruning  beamwidth  or  threshold  to 
the  scores  generated  by  the  HMMs,  excluding  frtm  the  match  any  HMM  whose  score  fedls 
below  the  threshold.  HMMs  which  are  included  in  the  match  are  described  as  “activated” 
and  those  that  are  excluded  are  “deacdvated”. 

In  addition  to  the  mechanism  for  deactivating  HMMs  provided  by  the  pruning  beamwidth, 
we  need  a  method  for  activating  HMMs  so  that,  as  die  utterance  progresses,  the  HMMs 
corresponding  to  the  new  portions  of  the  speech  may  be  brought  into  the  match.  This 
is  usually  done  with  a  grammar.  The  grammar  takes  the  firud  state  score  of  an  active 
HMM  and  sets  the  scares  of  each  of  the  possible  following  HMM’s  irutial  state  to  the  final 
score  multiplied  by  a  grammar  score.  If  the  score  of  this  new  HMM  is  above  tire  pruning 
threshold,  the  HMM  is  made  active  and  hencefortii  included  in  die  match  until  deactivated. 

Since  the  amount  of  work  poformed  by  the  recognizer  is  dependent  upon  the  number  of 
active  HMMs  and  since  this  strategy  pays  no  attention  to  the  utterance  beyond  the  present 
frame,  many  models  are  made  active,  only  to  be  deactivated  a  number  of  frames  later, 
having  done  no  useful  woik.  In  addition,  even  if  dte  word  ends  with  a  good  score,  there 
may  not  be  any  path  to  the  end  of  the  utterance  that  is  both  grammatically  and  acoustically 
correct  Since  it  may  take  many  frames  of  the  utterance  for  the  score  of  such  an  activated 
HMM  to  fall  below  the  pruning  threshold,  most  of  the  active  models  in  a  CSR  system  may 
be  in  this  situatitxi  and  consequendy  most  of  die  work  performed  in  matching  these  HMMs 
to  the  utterance  will  ultimately  be  wasted.  A  fast-match  algorithm  could  be  used  to  reduce 
the  number  of  these  needlessly  activated  models  by  looking  at  the  next  few  frames  of  the 
utterance,  but  it  would  be  better  if  we  had  some  way  of  predicting  whether  the  activation 
of  a  particular  model  at  a  particular  point  in  an  utterance  is  consistent  with  the  rest  of  the 
utterance. 

The  usual  method  of  implementing  a  fast-match  algonthm[ll,  38]  in  a  CSR  system  is 
to  use  two  recognition  systems  in  tandem.  Hie  first  system  acts  as  a  filter  to  prune  out 
words  which  have  no  chance  of  being  recognized  by  die  seccmd,  more  detailed,  system. 
In  a  time-synchronous  CSR  system,  the  fast  match  recognizer  examines  the  speech  signal 
for  a  sectitm  of  speech  after  die  time  which  the  second  system  has  reached  sumI  performs 
a  coarse  analysis  of  the  speech  signal.  The  first  system  analyzes  this  section  of  speech 
and  compares  it  to  models  of  the  initial  portions  of  aU  the  words  in  the  vocabulary  and 
compiles  a  list  of  words  which  might  start  during  the  “lookahead”  portion  of  speech.  The 
second,  frame-synchronous,  system  operates  in  the  usual  manner  (e.g.  the  Viterbi  decoding 
algorithm)  except  that  it  does  not  consider  starting  words  not  in  the  list  compiled  by  the 
fast-match  algorithm. 
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Fast  man:h  algorithms  suffer  from  the  disadvantage  that  diey  have  to  be  tuned— Lc. 
there  are  arbitrary  parameters  of  the  algorithms  which  have  to  be  adjusted  in  older  to 
achieve  a  trade-off  between  the  selectivity  (which  words  are  included  in  the  list)  against 
the  chance  of  a  search  error  (i.e.  a  situation  where  a  word  that  is  actually  starting  is  not 
included  in  the  list). 

Since  a  fast-match  algcnithm  works  by  examining  a  lookahead  porticm  of  an  uttnance 
to  determine  whether  a  word  is  likely  to  begin,  there  is  a  trade-o^  in  the  length  of  the 
lotdcahead  region:  in  order  to  be  able  to  mnit  as  many  words  as  possible  finmn  the  search, 
it  is  desirable  that  the  fast-match  algorithm  lo(dc  as  far  ahead  as  possible,  but  die  more  the 
fast-match  looks  ahead,  the  more  tte  detailed  nuitch  is  delayed.  This  is  because  when  die 
fast-match  reaches  the  end  of  the  utterance,  dw  detailed  match  must  be  delayed  by  at  least 
the  lengdi  of  the  locdcahead  region.  Thus,  ^ically,  no  attempt  is  made  to  confirm  that  dtt 
speech  beyond  is  ccmsistent  with  the  hypothesized  word. 

Also,  as  the  fast-match  algorithm  has  to  wmk  at  high  speed,  the  lengdi  of  this  lookahead 
region  must  be  quite  small.  This  leads  to  a  prx^lem.  For  example,  if  a  fast-match  algoridim 
chose  a  word  on  the  basis  of  the  first  two  phonemes,  there  would  be  no  way  of  distinguishing 
between  “ADD”  and  “ADDITIONAL”,  the  utterance  was  ”...  ADD  AN  AREA  ...”,  the 
inclusion  of  “ADDITIONAL”  would  be  wasteful,  since  it  would  result  in  a  bad  match 
for  the  phonemes  beyond  die  first  two.  Although  a  fast-match  algoridim  may  reduce  the 
amount  of  work  of  the  second  system,  its  effectiveness  is  hampered  both  by  its  near-sighted 
nature  and  by  the  fact  that  the  acoustic  match  used  in  die  fast-match  criterkMi  is  inferior  to 
that  used  by  the  second  system.  Thus  a  fast-mamh  system  must  either  be  cautious  or  make 
mistakes. 

The  FBS  avoids  these  trade-off  problems  while  also  achieving  increases  in  speeds  by 
a  factor  of  at  least  40. 


7.13  The  Forward-Backward  Search  (FBS) 


As  its  name  suggests,  die  FBS  takes  place  in  two  phases.  The  first  phase  performs  a 
fast  time-synchronous  search  of  the  utterance  in  the  forward  diiecticn.  The  backward 
pass  performs  a  more  expensive  search,  processing  the  utterance  in  the  reverse  direction 
and  using  infortnatitMi  gathered  by  the  forward  pass.  It  is  used  when  implemmiting  a 
complex  algorithm,  such  as  die  N-best[85,  32]  algorithm  which  stands  lit^  chance  of 
being  performed  in  real-time  by  itself. 

The  FBS  is  an  algorithm  which  does  this  and  can  be  used  when  (me  of  the  following 
cases  applies. 
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•  A  simplified  form  of  the  algorithm  is  available.  For  example,  the  N-best  search  is  an 
expensive  time-synchronous  algorithm  which  produces  the  N  most  likely  sentence 
hypotheses  for  a  given  utterance.  A  simple  version  of  this  is  the  ^t«hi  decoding 
algorithm  which  only  produces  one  hypothesis  for  an  utterance,  but  which  takes  far 
less  time  computatitxially. 

•  A  simplified  acoustic  model  is  available.  In  the  BYBLOS  system,  we  use  S-staie 
triphone  HMMs.  Other  CSR  systems  mi^t  use  more  complex  acoustic  models.  A 
simplification  of  diis  could  be  to  use  1-state  models. 

•  A  simplified  language  model  is  used.  We  have  perfonned  experim«its  using  both 
\ery  large  statistical/hnite-state  grammars  and  rule-based  grammars.  By  using  the 
FBS  wiA  a  simplified  granunar,  we  have  been  able  to  run  experiments  which  would 
have  been  impossible  due  to  the  large  amount  of  grammar  work  involved. 


In  such  cases  it  is  possible  to  use  information  generated  by  the  simplified  recognititm 
strategy  to  greatly  reduce  the  amount  of  time  taken  in  a  seocmd  pass  of  fire  algtxidim. 
For  a  real-time  system,  the  implication  of  this  is  that  die  simplified  forward  pass  can  be 
performed  in  real-time,  and  the  backward  pass  can  be  perfonned  in  a  very  short  time  after 
the  utterance  has  finished.  The  delay  involved  in  waiting  for  die  backward  pass  is  about  the 
same  as  the  delay  needed  to  implement  a  fast-match  lookahead.  Fm*  example,  die  N-best 
algorithm  is  computationally  expensive  and  takes  more  than  20  times  real-time  to  compute 
for  the  1000  word  resource  management  task.  We  can  however  perform  a  simple  Viierbi 
time-synchronous  beam  search  in  real-time  and  use  information  gmhered  from  dtis  to  speed 
up  the  N-best  algorithm  by  about  a  factor  of  40  to  a  point  where  it  finishes  with  only  a 
short  delay. 

The  key  to  the  FBS  is  that  while  the  forward  search  is  performed,  the  scores  6f  the 
final  state  of  each  active  HMM  which  represents  a  word  ending  are  recorded  at  each  frame 
of  the  utterance  together  with  a  record  of  which  HMMs  were  active.  We  will  denote  the 
set  of  words  which  are  active  at  time  t  of  the  uttnance  as  and  the  scores  of  the  final 
states  of  each  word  w  in  17*  as 

a(w,<) 

We  will  also  need  the  maximum  score  of  all  sentence  hypodieses,  Le.  die  score  of  die 
optimal  path  determined  by  the  forward  pass.  We  will  call  this  a^. 

After  the  simplified  forward  pass  has  been  completed,  the  seccmd,  expensive  algorithm 
is  performed  in  reverse — that  is,  the  seccmd  pass  starts  by  taking  the  final  frame  oi  die 
utterance  and  works  its  way  back,  matching  frames  earlier  in  the  utterance  until  it  reaches 
the  start  of  the  utterance.  At  each  frame,  f,  of  the  utterance,  die  exit  score  of  each  active 
HMM  (which  corresponds  to  the  initial  HMM  state,  because  the  HMMs  are  operating  in 
reverse)  is  worked  back  though  the  grammar  to  give  an  input  sccme  for  die  each  HMM 
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linked  to  it  We  call  such  a  HMM  ta'  and  the  score  which  die  grammar  has  conputed  with 
the  aid  of  the  previous  HMM  is  called 


If  w'  is  not  already  active,  we  can  use  the  FBS  to  decide  whether  it  is  worth  making  it 
active.  The  first  thing  to  note  is  diat  if  u;'  is  not  a  member  of  we  can  coiclude,  solely 
on  the  basis  of  the  portion  of  the  utterance  between  the  start  and  <,  diat  all  paths  from  w' 
to  the  start  of  the  utterance  resulted  in  such  low  scoes  that  they  would  be  pruned  from  the 
recognition.  Therefore,  if  w'  ^  we  do  not  make  the  HMM  active. 

This  in  itself  results  in  a  great  saving  since  it  drastically  reduces  the  number  of  HMMs 
active.  The  algorithm  has  effectively  “looked  into  the  future**  to  decide  if  die  HMM  has  a 
chance  to  feature  prominently  in  die  final  recognition  result  However,  it  is  also  possible 
to  check  whether  the  scores  from  die  beginning  of  the  utterance  to  t  and  from  f  to  d^  end 
of  the  utterance  together  imply  that  all  paths  passing  dirough  w'  at  i  will  be  pruned  at  stmie 
time. 

The  maximum  score  for  all  sentence  hypotheses  passing  though  u'  at  time  t  is 

and  the  maximum  score  for  all  sentence  hypotheses  is  oF.  Thus,  if  the  quantity* 

a** 

falls  below  the  pruning  threshold,  we  can  conclude  that  all  paths  that  pass  though  die  final 
state  of  w'  at  f  will  eventually  be  pruned.  If  this  is  the  case,  again  it  is  useless  to  activate 
u}'  at  this  point  in  the  utterance. 

Figure  7.1  describes  how  the  FBS  works.  It  depicts  a  situation  which  could  arise  during 
the  backward  pass  of  the  FBS.  The  paths  coining  from  the  top  left  hand  comer  represent 
the  terminal  scores  collected  from  the  forward  pass  aiul  those  from  die  top  right  represent 
the  scores  in  the  backward  pass  which  are  offered  to  die  terminal  states  of  models  a,  b, 
c  and  d.  Tte  distance  fimn  the  of  the  figure  represents  die  logarithm  of  die  score,  so 
the  product  of  two  numbers  is  represented  by  die  sum  of  die  distances  from  die  (rf 
the  figure.  For  this  value  of  t,  padi  d  can  immediately  be  eliminated  since,  although  it  is 
currently  the  highest  scoring  pt^  in  the  backward  pass,  its  exclusion  firom 

I2*  =  {e^b^a^c'} 

‘This  quantity  is  very  similar  to  the  eiqiression  for  the  likelihood  calculated  as  part  of  the  Baum-Wdch 
forward-lMCkwam  training  algorithm  (in  this  case  we  are  calculating  the  maumun  Ukelihood  as  opposed  to 
the  toud  likelihood  used  in  die  training  algorhhm)  and  is  the  reason  behind  die  naming  of  the  FBS. 
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forwards  ^  ^backwards 


Figure  7.1:  Forward-Backward  Search.  Forward  aiKi  backward  scores  for  the  same  state 
and  time  are  added  to  predict  final  score  for  each  dnory  extensicML 

indicates  that  it  scores  so  badly  in  the  portitm  of  die  utterance  from  t  down  to  the  beginning 
that  it  will  be  pruned.  The  elimination  of  die  other  three  paths,  a,  b  and  c  depends  on  die 
pruning  beamwiddi.  Since 

is  the  sum  of  the  “depdis”  of  the  two  paths  at  t,  path  c  is  most  likely  to  be  eliminated, 
followed  by  b  and  finally  a. 

A  consideradon  in  using  the  FBS  for  CSR  systems  which  use  HMM  to  model  phoonnes 
(as  opposed  to  whole  words)  is  that  die  recording  and  checking  of  HMM  scores  can  be 
restricted  to  a  subset  of  die  HMMs.  For  instance,  in  CSR  systmns  udiidi  use  jdioiieme- 
based  HMMs  which  are  ctmcatenated  to  represent  whole  words  it  is  more  efficient  to 
only  consider  the  phoneme  HMM  which  represent  the  ends  of  die  words.  Phoneme-based 
HMM  CSR  systems  often  join  up  strings  of  phonemes  to  represent  words.  Since  die 
HMMs  representing  the  end  of  the  word  will  tmly  have  a  significant  score  if  the  whole 
word  matches,  few  wmd-end  HMMs  will  be  active.  In  the  backward  pass  many  of  die 
word-end  phonemes  could  become  active,  since  the  recognition  is  proceeding  in  reverse, 
but  the  application  of  the  FBS  limits  these  prevents  diis.  Since  die  word-ends  do  not 
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become  active  in  die  backward  pass  no  odier  phoneme  model  in  the  word  can  be  activated. 


7.1.4  Implementation  of  the  Forward-Backward  Search 


Peihaps  one  of  the  most  appealing  features  of  the  FBS  is  the  ease  with  which  it  can  be 
implemented.  The  first  task  in  die  implementation  is  to  make  the  second  pass  operate  in 
reverse.  This  can  be  done  in  two  ways:  die  algnithm  can  be  made  to  scan  die  uttoance 
in  reverse  and  die  state  and  grammar  transitions  can  be  traced  backwards  or  the  utterance 
itself  can  be  reversed  as  well  as  the  transitions  in  the  HMMs  and  grammar  and  the  actual 
algorithm  may  remain  unchanged.  When  ccmvoting  an  existing  algoridim  to  use  the  FBS, 
the  easiest  way  may  be  to  use  the  second  strategy,  but  for  ease  of  notatitm,  we  will  adopt 
the  first 

t  =  T; 

while  ( t  >  0) 
t  =  t-1 

for  each  u  ActiveWords 
UpdateAllStates)MthinWord(); 

X  -  GetSetOfAllBackwardlinkedHmmsO; 
for  each  in  X 

-  GetHmmScorel^thGrammar(fa;,u;') 
end 

MaxBeta  =  Max(/5(ci;',<)) 
fOT  each  w'  in  X 

if  09(u>',  t)  >  PruningBeamwidth  *  MaxB^  AND 
w'  6  n*  AND 

a(u>',t)4i/3((i;',t)/a^  >  PruningBeamwidth) 
begin 

AddHmmToActiveWwds 

end 

end 

end 

end 


Figure  7.2:  Pseudo-code  implementation  of  backward  part  of  Forward-Backward  Search. 

The  forward  pass  of  the  FBS  proceeds  as  normal  except  that  additicmal  infonnaticm  is 
gathered  and  stored  for  each  time  t.  We  denote  the  set  of  phoneme  HMMs  which  are  both 
active  at  time  i  and  which  also  correspond  to  the  end  of  words  by  We  record  tiie  set 
/?*,  and  also  for  each  ui  €  we  record  a(u;,t),  the  score  (or  likelihood)  of  tlw  final  (or 
exit)  state. 
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After  the  backward  pass  has  been  modified  to  run  in  reverse,  a  small  modification 
is  necessary  to  complete  the  implementation.  At  each  t,  the  backward  algorithm  must 
propagate  scores  from  an  active  HMM  {u)  back  through  the  grammar  to  another  HMM 
(a;')  at  the  same  time  calculating  die  HMM/grammar  score  for  u'  for  the  next  frame  (actually 
i  —  1).  In  a  normal  beam-search,  this  would  be  compared  to  die  pruiung  threshold  in  order 
to  limit  the  number  of  theories  in  the  search,  but  in  the  FBS  we  apply  the  aAtitirwial  tests 
of  inclusion  in  the  set  and  of  die  forward-backward  score. 

Figure  7.2  shows  how  a  pseudo-code  implementation  of  the  backward  pass  of  the  FBS 
might  look.  The  only  difference  from  the  original  algoridim  is  die  the  second  and  third 
lines  of  the  “if’  statement  (indicated  by  boxed  text)  which  is  performed  when  deciding 
whether  to  make  a  HMM  active,  ^thout  the  FBS,  this  “if’  statement  would  be  something 
like: 


if  03(0;',  t)  >  PruningBeamwidth  *  MaxBeta) 
begin 

AddHmmToActiveWOTds 

end 


7.1.5  Use  of  Forward'Backward  Search  in  a  Real-Time  System 


BBN’s  HARC  real-time  continuous  speech  reception  systems  uses  the  Word-Dependent 
N-best[85, 32]  algorithm  to  present  an  order  list  of  word  hypotheses  to  die  DELPHI  luuural 
language  (NL)  processor.  The  N-best  paradigm  was  chosen  as  the  interface  between  die 
two  parts  of  the  system  both  because  die  sentence  accuracy  of  the  optimal  (Vitmbi)  answer 
from  the  speech  recognition  part  is  not  high  enough  to  be  usable  and  also  because  it 
is  difficult  to  use  a  complex  natural  language  decision  process  at  the  frame  level  of  the 
recognition. 

The  N-best  algorithm  would,  by  itself,  be  impractical  to  use  in  a  real-time  system, 
because  it  takes  about  20  times  real-time  to  compute.  Not  only  do  we  use  the  FBS  to 
speed  the  N-Best  algoridim  to  run  in  near  real  time,  but  we  also  use  the  result  of  die 
forward  (1-best)  pass  to  provide  the  natural  language  system  with  the  most  likely  answer 
soon  after  the  end  of  utterance  is  detected.  Bgure  7.3  tilustrates  how  this  w(»ks. 

An  MTU  A/D  converter  digitizes  the  speech  and  feeds  it  to  a  dual  C30  board  which 
is  housed  in  a  Sun4/330.  This  board  performs  analysis  and  vector  quantization  on  the 
speech  and  also  detects  the  start  and  end  points  of  the  utterance.  The  vector  quantized 
speech  is  transferred  to  the  main  Sun4  via  the  VME  bus.  Here,  the  Sun4  performs  die 
forward  pass  of  the  FBS.  Soon  after  die  end  of  the  utterance  is  detected,  not  only  has  die 
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Figure  7.3;  Synchronization  of  Forward-Backward  Search  with  NL  processing  in  the  real¬ 
time  speech  understanding  system. 

ftHward  pass  computed  the  statistics  for  the  forward-backward  algorithm,  but  it  has  also 
cmnputed  the  A^terbi  (1-best)  transcription  of  the  utterance.  This  first  utterance  is  dien 
handed  to  die  natural  language  system  which  runs  on  a  sectmd  Sun4  processor.  If  dns  first 
utterance  is  accepted  by  die  NL  system,  then  a  database  enquiry  is  perfonned  immediately. 
However,  in  case  die  first  utterance  is  not  acceptable  by  the  NL  system  (on  syntactic  or 
semantic  grounds),  die  backward-pass  of  die  FBS  is  perfonned  at  die  same  time  as  die  NL 
is  processing  the  first  answer.  Thus,  if  the  NL  rejects  the  most  likely  answer,  die  second 
and  subsequent  answers  are  available  for  the  NL  to  attempt  to  understand. 


7.2  Time-Synchronous  Statistical  Language  Model  Search 

7.2.1  Basic  Search  Algoridim 


We  know  that  any  language  model  diat  severely  limits  what  sentences  are  legal  cannot 
be  used  in  a  real  SLS  because  people  will  almost  always  violate  the  constraints  of  die 
language  model  Thus,  a  Word-Pair  type  language  model  will  have  a  fixed  high  error 
rate.  The  group  at  IBM  has  long  been  an  advocate  of  statistical  language  models  that 
can  reduce  the  entropy  or  perplexity  of  the  language  while  still  allowing  all  possible  word 
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sequences  with  some  probability.  For  most  SLS  domains  where  there  is  not  a  large  amount 
of  training  data  available,  it  is  most  practical  to  use  a  statistical  model  of  word  classes  rather 
than  individual  words.  We  have  circulated  a  so  called  Class  Grammar  for  the  Resource 
Management  Domain  [35].  The  language  model  was  simply  constructed,  having  only  first- 
order  statistics,  and  not  distinguishing  the  probability  of  different  words  within  a  class.  The 
measured  test  set  perplexity  of  this  language  model  is  about  100.  While  more  powerful 
“fair”  models  could  be  constructed,  we  felt  tiiat  this  model  would  predict  the  difficulty  of  a 
somewhat  larger  task  domain.  The  word  error  rate  is  typically  twice  that  of  the  Word-Pair 
(WP)  grammar.  One  problem  with  this  type  of  grammar  is  that  the  computation  is  quite 
a  bit  laigo*  than  for  tte  WP  grammar,  since  all  1000  words  can  follow  each  word  (r^er 
than  an  average  of  60  as  in  the  WP  grammar). 

During  our  work  on  statistical  grammars  in  1987  [78],  we  developed  a  technique  diat 
would  greatly  reduce  the  computational  cost  for  a  time-synchronous  search  with  a  statistical 
grammar.  Normally,  the  number  of  grammar  transitions  in  a  bigram  statistical  grammar  is 
equal  to  the  square  of  the  number  of  words.  (In  a  tiigram  grammar  it  is  equal  to  the  cube 
of  the  number  of  words.)  However,  most  of  the  bigram  probabilities  are,  in  fact,  estimated 
from  a  weighted  version  of  the  unigram  probabilities,  since  the  bigrams  did  not  occur  in  the 
training  set  for  the  language  model.  There  are  many  different  methods  for  combining  the 
bigram  and  unigram  probabilities,  including  padding  the  bigram  probabilities,  a  weighted 
sum  based  on  the  number  of  occurrences  of  the  first  word  in  the  bigram,  and  the  backing-off 
algorithm  of  Katz.  In  each  of  these  cases,  the  grammar  structure  can  be  modified  to  take 
into  account  the  weighting  scheme  in  such  a  way  that  only  the  bigrams  that  have  actually 
been  observed  need  to  be  represented  as  explicit  transitions  in  the  grammar.  This  greatly 
reduces  the  computation  need  for  recogiution  with  a  bigram  grammar.  It  is  also  possible 
to  reduce  the  computation  for  a  higher-order  grammar  in  an  analogous  way.  This  approach 
is  described  more  fully  below. 

Figure  7.4  illustrates  a  fully-connected  first-order  statistical  grammar.  If  the  number 
of  classes  is  C,  then  the  number  of  null-arcs  connecting  the  nodes  is  C^.  However,  since 
the  language  models  are  rarely  well  estimated  most  of'the  class  pairs  are  never  observed 
in  the  training  data.  Therefore,  most  of  these  null-arc  transition  probabilities  are  estimated 
indirectly.  Two  simple  techniques  that  are  commonly  used  are  padding,  or  interpolating 
with  a  lower  order  model.  In  padding  we  assume  that  we  have  seen  evoy  pair  of  words 
or  classes  once  before  we  start  training.  Thus  we  estimate  p(c2|ci)  as 


P(C2|ci)  = 


•^^(Cl,C2)-t-  1 

Nici)  +  C 


In  interpolation  we  average  the  first-order  probability  with  the  zeroth-<Mder  probability 
with  a  weight  that  depends  on  the  number  of  occurrences  of  the  first  class. 
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Figure  7.4:  Fully  Coiuiocted  First-Ordo’  Smtistical  Grammar.  Requires  null  arcs. 


p(C2|ci)  =  A(ci)^C2|c1)  +  [1  -  A(Ci)]^C2) 

where 


and 


P(c2|cl) 


Njcuci) 

N(ci) 


Niall  wards) 
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In  either  case,  when  the  pair  of  classes  has  never  occurred,  the  probability  can  be 
represented  much  more  simply.  For  the  latter  case  of  interxmlated  models,  when  Nie\ ,  C2)  = 
0  the  expression  simplifies  to  just 


[1  -  A(c,)1^C2) 


The  first  term,  1  —  A(ci),  depends  (mly  on  first  class,  ^x^e  the  seccmd  term,  fiez), 
depends  only  on  the  second  class.  We  can  represent  all  of  dwse  probabilities  by  adt^g  a 
zero-ordo’  state  to  the  language  model.  Hgure  7.5  illustrates  this  model.  Fnnn  each  class 
node  we  have  a  null  transition  to  die  zero-order  state  with  a  probability  given  by  the  first 
term.  Then,  from  the  zero-order  state  to  each  of  die  following  class  nodes  we  have  die 
zero-order  probability  of  that  class. 


Figure  7.5:  21ero-state  within  first-order  statistical  grammar.  All  of  the  transitioiis  estimated 
from  no  data  are  modeled  by  transitions  to  and  from  die  zero-state. 
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Now  that  die  probabilities  for  all  of  the  estimated  transitions  has  been  taken  care  of 
we  only  need  the  null  transitions  that  have  probabilities  estimated  from  actual  occurrences 
of  the  pairs  of  classes,  as  shown  in  Hgure  7.6.  Assuming  that,  cm  average  there  are  B 
different  classes  that  were  observed  to  follow  each  class,  where  B  «  C,  the  total  number 
of  transitions  is  only  C{B  +  2).  For  the  100-class  grammar  we  find  that  B  =  14.8,  so  we 
have  1680  transitions  instead  of  10,000.  This  savings  reduces  both  the  computaticm  and 
storage  associated  with  using  a  statistical  grammar. 


Figure  7.6:  Sparsely  Connected  First-Order  Statistical  Grammar  with  zero-state  requires 
much  fewer  null  arcs. 

It  should  be  clear  that  this  technique  can  easily  be  extended  to  a  hi^ier  Ofder  language 
model.  The  unobserved  second-order  transitkms  would  be  removed  and  replaced  widi 
transitions  to  a  general  first-order  state  for  each  word  or  class.  From  these  we  then  have 
first-order  probabilities  to  each  of  the  following  words  cm  classes.  As  we  increase  the  order 
of  the  language  model,  the  percentage  of  transitions  that  are  estimated  cmly  from  lower 
order  cxxurrences  is  expected  to  increase.  Thus,  the  relative  savings  by  using  this  algorithm 
will  increase. 
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7.2.2  Time-synchronous  Forward  Probability  Search  vs  Viterbi 


The  search  algorithm  that  is  most  commonly  used  is  the  Viterbi  algorithm.  This  algcnithm 
has  nice  properties  in  that  it  can  proceed  in  real  time  in  a  time-synchronous  manner,  is 
quite  amenable  to  the  beam-search  pruning  algorithm  [59],  and  is  also  relatively  easy  to 
implement  on  a  parallel  processor.  Anodier  advantage  is  that  it  only  requires  ctnnpares 
and  adds  (if  we  use  log  probabilities).  Unfortunately,  the  Viterbi  algorithm  finds  the  most 
likely  sequence  of  states  rather  than  the  most  likely  sequence  of  wmds. 

To  compute  the  probability  of  any  particular  sequence  of  words  correctly  requires 
that  we  add  the  probabilities  of  all  possible  state  sequences  for  those  words.  This  can 
be  done  with  the  “forward  pass”  of  the  forward-backward  training  algorithm.  The  only 
difference  between  the  \^terbi  scoring  and  the  Forward-pass  cmnputation  is  that  we  add 
the  probabilities  of  different  theories  coming  to  a  state  rather  than  taking  the  maximum. 

We  presented  a  search  algoridun  in  1985  [86]  that  embodied  most  of  this  effect  Basi¬ 
cally,  within  words  we  add  probabilities,  while  between  words  we  take  the  maximum.  It 
was  not  proven  at  that  time  how  much  better,  if  any,  this  algorithm  was  than  the  simpler 
Viterbi  algorithm,  and  whether  it  was  as  good  as  the  strictly  correct  algorithm  that  computes 
the  score  of  each  hypothesis  independently. 

When  we  compared  these  three  algoridims  under  several  conditions,  we  found  that  there 
was  a  consistent  advantage  for  adding  the  probabilities  widiin  the  word.  For  example,  when 
we  use  the  class  grammar,  we  find  that  the  word  error  rate  decreases  from  8%  to  6%. 

To  be  sure  that  the  time-synchronous  forward  search  gives  us  the  same  performance 
as  the  ideal  forward  score  is  scxnewhat  mote  complicated  We  must  guarantee  that  we 
have  found  the  highest  scoring  sentence  with  the  true  forward  probability  score.  One  way 
to  find  this  is  to  use  the  exact  N-Best  algorithm  [32].  Since  die  exact  N-Best  algorithm 
separates  the  computation  for  any  two  different  hypotheses,  die  scores  that  result  are,  in 
fact,  the  correct  forward  probabilities,  as  long  as  we  set  N  to  a  large  enough  value.  A 
second  much  simpler  way  to  verify  the  time-synchronous  algorithm  is  to  see  if  it  ever 
gets  a  wrong  answer  that  scores  worse  than  die  correct  answer.  We  ran  a  test  in  which  all 
inconect  answers  were  rescored  individually  using  the  forward  probability.  We  compared 
these  scores  to  the  forward  probability  for  the  conect  answer.  In  no  case  (out  of  300 
sentences)  did  the  time-synclnonous  forward  search  ever  produce  a  wrcmg  answer  diat,  in 
fact,  scored  worse  than  the  conect  answer. 

The  reason  that  this  whole  discussion  about  the  Viterbi  algorithm  is  relevant  here  is  diat 
the  Viterbi  algorithm  is  faster  than  the  forward  search.  Therefore,  initially,  we  used  the 
integer  \^terbi  algorithm  in  the  forward-pass  of  the  Forward-Backward  Search.  Since  the 
function  of  die  forward-pass  is  primarily  to  say  which  words  are  likely,  it  is  not  essential 
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that  we  get  the  best  possible  answer.  The  bacloxoBd  N-Best  search  was  then  done  using 
the  correct  algorithm  that  adds  different  state-sequence  probabilities  for  the  same  word 
sequence.  However,  mote  recently  we  have  found  that  dw  floating  point  hardware  of  the 
newer  workstations  is  fast  enough  so  that  there  is  no  significant  sp^  advantage  for  die 
\^terbi  search  at  any  point  Therefore,  we  now  use  the  full  probability  search  in  both 
directions. 


7.3  Real-Time  Recognition  on  Commercially  Available  Hardware 


It  has  long  been  believed  diat  teal-dme  spoken  language  systems  would  require  immense 
amounts  of  computadtm  that  would  only  be  available  through  die  use  of  special-purpose 
hardware  or  large-scale  parallel  computing  systems.  Unfortunately,  in  several  attempts 
to  design  or  use  parallel  or  special-purpose  hardware  at  DARPA  research  sites,  ntuie  has 
achieved  real-time. 

Instead,  we  believe  that  it  is  fundamentally  more  fruitful  to  reduce  the  ctnoputation 
required  for  spoken  language  understanding.  In  particular,  rather  than  gaining  a  (»e-time 
factor  of  10  for  special-purpose  or  parallel  hardware,  we  can  often  produce  several  orders 
of  magnitude  in  speed  improvement  by  the  invendtm  ttf  clever  algorithms  that  approach 
the  same  problem  in  a  different  way.  (We  have  described  several  such  algoridims  earlier 
in  this  report)  In  addidtm,  tmce  the  computation  has  been  reduced,  the  algotiduns  can 
be  run  on  a  broad  class  of  ctmunercially  available  woricstadtms.  This  outcome  is  clearly 
preferable  to  requiring  special  hardware. 


7.3.1  Introduction 


One  goal  of  the  Spoken  Language  System  (SLS)  project  is  to  demcmstiate  a  real-time 
interactive  system  diat  integrates  speech  recognition  and  natural  language  processing.  Wb 
believe  that  cutrendy,  the  most  practical  and  efficiem  way  to  integrate  qjeech  recognition 
with  natural  language  is  using  the  N-Best  paradigm,  in  which  we  find  die  most  likely  whtde- 
sentence  hypotheses  using  speech  recognition  together  widi  a  simple  language  model,  and 
then  filter  reorder  dtese  hypodieses  widi  natural  language.  The  accurate  computation 
of  the  N  Best  sentences  requires  significant  amounts  of  computation. 

Several  sites  have  been  developing  hardware  to  accelerate  die  speech  recognition  pro¬ 
cess.  Two  alternative  techniques  have  been  special  purpose  high  speed  VLSI  hardware 
[68]  and  several  custom  boards  with  more  general  purpose  processors  operating  in  parallel 
[17]  [83].  With  the  rapid  changes  in  commercial  hardware  in  the  past  few  years  we  fielt 
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that  if  we  could  achieve  real  time  with  commercially  available  hardware  we  would  have 
progressed  fiirtha*.  Therefore  the  goal  of  this  effort  was  to  find  appropriate  hardware  and 
to  improve  the  algorithms  such  that  the  available  hardware  was  sufficient  for  the  problnn. 

Specifically,  our  goals  are: 

1.  Real-Time  processing  widi  a  short  delay 

2.  reasonably  inexpensive  commercially  available  hardware 

3.  source  code  compatibility  wiffi  our  research  programs 

4.  computation  of  the  N  Best  sentences  for  large  N 

5.  using  a  robust  fully-connected  statistical  grammar 

6.  within  practical  memmy  limitations 

7.  with  negligible  loss  of  accuracy  due  to  real  tin^  limitations 

8.  extendable  to  laig»  vocabulary 

The  new  algorithms  used  in  this  effort  have  been  described  above.  Specifically,  a 
method  for  efficient  decoding  with  a  statistical  grammar,  die  Wnrd-Dqiendent  N-Best 
search  and  the  Forward-Backward  Search.  In  diis  section,  we  will  describe  die  class  of 
hardware  that  we  have  used  to  adiieve  real  time  recognition. 

7J.2  Hardware 

The  advantages  of  using  ccnnmercially  available  hardware — if  possible — are  obvious. 

1.  No  research  develofmient  cost 

2.  Improvements  in  the  computer  hardware  industry  are  available  immediately. 

3.  Developments  can  be  shared  with  a  broader  community. 


We  will  review  the  sequence  of  considerations  and  decisitms  for  die  Qqie  of  hardware 
that  we  would  use  that  todc  place  during  the  course  of  diis  project  Briefly,  we  started  out 
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using  five  sq)aratB  fnocessor  systems  for  the  sampling,  signal  processing  .5  vector  quan¬ 
tization,  lecognitimi  search,  and  understanding.  As  time  went  on,  it  becfUi^t  idvantageous 
to  eliminate  these  boards,  (me  by  erne,  until  we  arrived  at  dm  point  where  we  now  perform 
all  of  the  cennputations  direedy  cm  a  single  workstadon.  This  evolutkm  t(X)k  place  because 


1.  We  continimd  to  decrease  the  computadtm  needed  for  each  {riiase  of  die  reo^gnitkm 
process. 

2.  Workstations  ctmtinued  to  get  faster  and  clmapo^,  and  eventually  included  built-in 
audio  boards  with  A/D  ciqiability. 


The  detailed  evoluti<m  is  outlined  below. 

Starting  in  January  of  1990,  we  considered  several  cqitkms  for  off-the-shelf  processing 
boards.  It  was  already  quite  straightforward  to  perform  signal  processing  in  re^-time  cm 
boards  with  signal  processor  chips.  However,  die  speech  reo^gnitkm  search  requires  a 
large  amount  of  computation  together  widi  several  MB  of  fast  readily  accessible  memory. 
In  the  past  diere  have  not  been  ccmimetcially  available  boards  or  ineiqiaisive  computers 
that  meet  these  needs.  However  this  has  been  changing  over  die  past  couple  of  years.  The 
Motorola  88000  and  Intel  860  chips  are  now  in  boards  with  substantial  amounts  of  ranchmi 
access  memory.  Most  chips  now  come  with  C  cennpilers,  which  means  that  die  bulk  of 
development  programs  can  be  transferred  directly.  If  needed,  computaticmally  intensive 
inner  Icxips  can  be  hand  coded. 

After  cemsidering  several  choioes  we  chose  boards  based  cm  the  Intel  860  processor. 
The  Intel  860  processcff  has  a  peak  speed  of  80  MFLOPS.  At  the  time  we  considered  diis 
processor,  most  C  programs  that  we  had  run  (m  both  of  these  machines  ran  abemt  five  times 
faster  than  on  a  SUN  4/280. 

Figure  7.7  illustrates  the  hardware  (xmfiguratiem  that  we  envisiemetL  The  host  was  a 
SUN  4/330.  The  microphone  was  connected  to  an  external  preamp  and  A/D  converter  whidi 
connects  direedy  to  die  serial  port  of  the  Sky  Challenger.  The  Sky  Challenger  widi  dual 
TMS320C30s  was  used  for  signal  processing  and  vector  quantization  (VQ).  The  SkyBcdt 
was  to  be  used  for  die  speech  recognitiem  N-Best  search.  The  boards  would  communicate 
with  the  host  and  each  other  duough  die  VME  bus,  making  high  speed  data  transfers  easy. 
However  curtendy  the  data  transfer  rate  between  the  boards  is  very  low.  The  SUN  4  would 
ccmtrol  the  overall  system  and  also  (xmtain  the  natural  language  un^rstanding  system  and 
the  applicatkm  back  end. 

The  design  given  above  met  our  requirements  for  speed  and  flexibility.  We  would 
use  all  four  processors  during  most  of  the  computatiem.  When  speech  has  started  die 
MTU  A/D  would  filter  and  digitize  die  signal  ami  feed  it — one  sample  at  a  time— to  die 
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Figure  7.7:  Real-Ume  Hardware  Configuratiai.  The  Sl^  Challenger  Dual  C30  board  and 
the  Intel  860  board  plug  directly  into  the  VME  bus  of  die  SUN  4. 

C30  board,  which  would  compute  the  signal  processing  and  VQ  in  real-time.  The  SUN 
4  would  accumulate  the  speech  for  possible  Itmg-term  storage  or  playback.  Nfeanwiule, 
the  Intel  860  would  compute  the  forward  pass  of  die  forward-backward  search.  When 
the  end  of  the  utterance  was  detected,  die  SUN  would  give  die  l-Best  answer  to  die 
natural  language  understanding  system  running  on  anodier  woricstation  for  parsing  and 
inteipretation.  Meanwhile  the  Intel  860  would  search  backwards  for  die  remainder  of  the 
N  Best  sentence  hypotheses.  These  should  be  completed  in  abmit  die  same  time  duu  die 
NL  system  requires  to  parse  the  first  answer.  Thmi,  die  NL  system  can  parse  down  die  list 
of  altemadve  sentences  until  an  acceptable  sentence  is  found. 

The  computation  required  for  parsing  each  sentence  hypodiesis  is  about  1/2  second. 
The  delay  for  the  N-Best  search  is  about  half  the  duraticm  of  the  smitence. 
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However,  shortly  after  ccmsidering  this  hardware  option,  several  develqnnents  changed  our 
minds  as  to  the  most  efficient  way  to  proceed. 


1.  As  we  sped  up  the  implementaticm,  we  iKMiced  that  die  SUN  4  versitm  got  faster, 
but  the  Intel  860  version  did  not  Presumably  this  was  due  to  better  usage  of  die 
cache  in  the  SUN  4.  The  eventual  speed  advantage  over  a  SUN  4/280  was  now  only 
a  little  over  3. 

2.  The  C  compilers  for  the  boards  always  had  bugs,  and  were  never  fully  integrated 
into  die  machine. 

3.  ^th  each  new  workstation,  the  bus  architecture  changed  making  it  necessary  to 
search  for  a  new  accelerator  board. 

4.  New  woricstadons  came  out  diat  were  much  faster  and  much  cheaper.  The  SUN 
4/330  was  already  1.6  times  faster  than  the  SUN  4/280,  and  tudien  the  Sparc  n  came 
out  it  was  about  3  times  faster  than  the  SUN  4/280.  There  are  other  workstations  like 
the  SGI  4D/35  and  die  HP  750  that  are  faster  still.  Therefore,  the  speed  advantage  of 
the  Intel  860  disappeared  ctnnpletely.  In  addition,  the  cost  of  these  new  warkstatimis 
decreased  to  the  point  where  they  were  cxinsiderabiy  less  eiqpensive  than  die  separate 
processor  boards. 


Therefore,  we  quickly  abandoned  the  idea  of  using  a  separate  board  for  the  recognition 
search,  in  favor  of  simply  using  die  workstation  itself.  We  continued  to  use  die  Sky 
Challenger  for  front  end  signal  processing  for  antxher  year,  since  this  replaced  a  significant 
amount  of  computation  that  could  be  done  easily  on  a  separate  board. 


13A  Software-Only  System 


Even  the  boards  used  for  signal  processing  have  two  significant  disadvantages: 


1.  They  often  cost  as  much  as  the  workstatiai  they  are  plugged  into. 

2.  The  interface  between  each  board  and  workstaticm  is  complicated,  and  always  dif- 
ferent  for  each  combination  of  workstation  and  board. 
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In  a  separate,  internally-funded  BBN  effort,  we  have  embarked  on  a  productization 
of  our  BYBLOS  speech  recognition  technology.  One  objective  of  this  effort  is  to  make 
speech  recogitition  available  to  a  broad  base  of  users  at  an  affordable  cost  To  that  etui,  we 
have  eliminated  the  disadvantages  given  above  by  developing  signal  processing  algorithms 
that  are  able  to  operate  in  real-time  on  COTS  workstations  without  requiring  additional 
add-on  hardware  and  without  decreasing  recognition  speed  and  accuracy.  An  additional 
advantage  is  that  we  axe  able  to  benefit  from  the  improvements  in  workstation  price  and 
performance,  with  very  miiumal  parting  effort  The  result  of  this  productization  effort  has 
been  the  development  of  a  real-time  system,  the  BBN  RUBY*^  speech  recognition  system. 

The  RUBY  system  is  based  entirely  on  a  single  workstation.  It  used  the  SGI  4D/35, 
which  has  a  built-in  high  quality  A/D.  The  computation  needed  fw  the  signal  processing 
was  reduced  to  the  point  where  it  required  (Hily  a  small  part  of  the  available  computatirmal 
power  of  the  workstation.  This  eliminated  the  need  for  a  separate  hardware  board  for  signal 
processing. 

Our  RUBY  system  was  demonstrated  at  the  DARPA  Spoken  Language  Workshop  at 
Arden  House  in  February  of  1992.  Two  applications  were  demcmstrated:  recognition  of 
aircraft  identification  spoken  by  an  air  traffic  controller  and  the  Airline  Travel  Information 
System  (ATIS),  which  included  both  speech  recognition  and  natural  language  understanding 
on  the  same  machine. 


7.3  J  Speed  and  Accuracy 

When  we  started  our  search  optimizatitm  effcnt  in  January  1990,  our  unoptimized  time- 
synchronous  forward  search  algorithm  took  about  30  times  real  time  to  recognize  with  the 
word-pair  grammar  with  the  beamwidth  set  U)  avoid  pruning  errors.  The  class  grammar 
required  10  times  more  computation.  The  exact  N-Best  algorithm  required  abmit  3,000 
times  real  time  to  find  the  best  20  answers.  When  we  required  the  best  100  answers,  the 
program  required  about  10,000  times  real  time.  Since  then  we  have  implemented  several 
search  algorithms,  optimized  the  code,  and  used  newCT,  fastm:  woritstations.  The  forward 
pass  decoder  now  runs  in  teal  time  and  the  N-Best  pass  now  runs  in  about  1/4  real  time, 
with  essentially  no  loss  in  accuracy.  Below  we  give  each  of  these  methods  along  with  die 
factor  of  speed  gained. 
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Statistical  grammar  algoridun  S 

Word-Depen^nt  N-Best  5 

FOTward-Backward  Search  40 

Code  Optimization  8 

Newer  Workstations  5 


Total  reduction  in  computation  40,000 

As  can  be  seen,  the  three  algotitiunic  changes  accounted  for  1,000  of  tiiis  amount,  i9^e 
the  code  (^timizatitm  and  faster  processor  accounted  for  a  facmr  of  40. 


Accuracy 


It  is  relatively  easy  to  achieve  real  time  if  we  relax  our  goals  for  accuracy.  For  example, 
we  could  simply  reduce  the  prurung  beamwidtii  in  the  beam  search  and  we  know  that 
the  program  speeds  up  tremendously.  However,  if  we  reduce  the  beamwidth  too  much, 
we  begin  to  incur  search  errors.  That  is,  the  answer  that  we  find  is  not,  in  fact,  tire 
highest  scoring  answer.  There  are  also  several  algoritiims  that  we  could  use  that  require 
less  computation  but  increase  the  error  rate.  While  some  tradeoffs  are  reasonable,  it  is 
important  that  any  discussion  of  real-time  computatitn  be  accompanied  by  a  statement  of 
the  accuracy  relative  to  the  best  possible  conditions. 

In  our  most  recent  real-time  demtmstratioD  in  the  ATIS  domain,  we  measured  the 
increase  in  error  rate  under  real-time  operatitm.  Ws  found  diat  die  overall  error  rate 
increased  by  only  7%  for  the  answerable  queries.  Thus,  we  have  shown  that  the  loss  due 
to  real  time  processing  is  almost  insigruficant  Wb  believe  tiiat  tins  marks  the  first  time  that 
real-time  recognition  fiR*  1,000  wmds  in  continuous  speech  could  be  accmnplished  witii  so 
little  loss  relative  to  the  very  best  performance  possible. 

In  general,  we  are  confident  that  the  general  approach  of  modifying  algorithms  and  using 
general  purpose  workstations  for  real-time  opoation  will  scale  up  to  larger  vocabularies 
and  more  complex  grarmnars. 


7J.6  Conclusioii 


We  have  achieved  real-time  recogrution  of  the  N-Best  sentences  tm  a  cmnmercially  available 
workstatitm.  Most  of  the  increase  in  speed  came  from  algorithm  modifications  ratiier  than 
from  fast  hardware  or  low-level  coding  enhanconents,  although  the  latter  improvements 
were  substantial  and  necessary.  All  tiie  code  is  written  in  C  so  there  is  no  machine 
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dependence.  All  told  we  sped  up  the  N-Best  computations  by  a  factor  of  40,000  widi  a 
combination  of  algorithms,  code  optimization,  and  faster  hardware. 


7.4  Demonstrations 


During  the  course  of  this  crmtract,  we  have  developed  several  demtmstrations  of  speech 
recognition  and  understanding.  Each  mie  was  a  breakthrough  at  die  time.  Briefly,  the 
majcM'  demonstrations  were: 


1.  August  1990:  Near-real-time  speech  recognition  of  Resource  Management  domaiiL 
This  demcHistration  was  faster  than  the  <mly  other  demonstration  at  tite  time,  which 
used  a  special-purpose  three  processor  board  for  the  search. 

2.  January  1991:  Real-time  spoken  language  system.  We  crmnected  die  real-time  speech 
recognition  to  the  natural  language  system  nmning  a  database  query  system  in  die 
DART  TRANSCOM  Military  *Ihmsp(»tati(m  Logistical  Plaiming  DcmaiiL 

3.  February  1992:  Real-Ume  spcdren  language  on  a  single  woricstatirm.  We  demon¬ 
strated  a  full  spoken  language  system  in  the  ATIS  dranain.  The  demonstration  used 
no  extra  hardware  beside  a  single  SGI  wrakstation.  Even  so,  it  was  two  to  four 
times  faster  than  any  of  the  other  demcmstrations  in  diis  domain,  even  though  diey 
all  used  additional  hardware  for  die  frcxit  end  signal  processing.  La  addition,  the 
accuracy  of  this  demcmstratitm  significandy  exceeded  die  odiers,  partially  because  of 
the  capabiliQr  of  using  a  trigram  grammar  in  real  time. 

4.  February  1992:  Demonstration  of  a  real  air  traffic  ctmtroUer  speech  recognitirm 
applicatitm.  The  application  was  significant  in  its  extremely  high  accuracy,  its  capa¬ 
bility  for  rejecting  out  of  vocabulary  items,  and  its  ability  to  recognize  and  ouqiut 
a  subgrammar  embedded  in  a  sentence  widiin  300  ms  of  die  end  of  speaking  die 
subgrammar  phrase. 


These  demonstratimis  are  described  in  more  detail  below. 


7.4.1  August  1990:  Real-Time  Speech  Recognition 

In  August  of  1990,  we  demonstrated  the  system  described  in  June  of  1990  at  the  Hidden 
Valley  Workshop.  In  particular,  we  used  a  SUN  4/330  w^nkstation,  with  a  Sky  Challmiger 
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signal  processing  board  for  die  front  end  signal  processing,  and  an  MTU  A/D  board  for 
capturing  the  signal.  All  of  the  search  computation  was  performed  on  die  SUN  workstation. 
We  demonstrated  near  real'dme  recognition  in  the  Resource  Management  Domain.  By 
“near  real-time”,  we  mean  that  the  first  choice  answer  for  the  sentence  was  typed  within 
a  very  short  delay  of  when  the  speaker  snipped  speaking.  Then,  the  N-Best  sentence 
hypotheses  were  typed  within  a  second  or  two  after  that  The  demonstration  of  speaker- 
independent  recognition  had  been  trained  on  only  a  small  number  of  speakers.  In  particular, 
8  male  and  7  female  speakers.  (We  had  previously  demonstrated  at  the  June  ’90  workshop 
that  we  had  obtained  approximately  the  same  accuracy  by  training  with  a  large  amount  of 
speech  firom  a  small  number  of  speakers  as  with  a  small  amount  of  qieech  frcxn  a  large 
number  of  speakers.)  We  believe  that  this  was  the  first  real-time  demonstration  on  the  RM 
domain. 

In  all,  we  sped  up  the  search  by  a  factor  of  20,000  with  a  combination  of  hardware 
and  software  improvements.  A  real-time  recognition  system  also  required  implementing  a 
complete  front  end  that  would  filter,  sample,  analyze,  and  vector  quantize  the  speech,  and 
pass  the  results  on  to  the  recognition  search — all  in  real  time.  We  used  a  programmable 
MTU  filter  and  A/D  converter  to  do  the  basic  speech  sampling.  We  used  a  Sky  Challenger 
with  two  TMS320C30  processors  to  ctmtrol  tiic  MTU  and  to  perform  the  signal  processing 
(Mel-Frequency  Gepstral  Analysis)  and  vector  quantizatitm.  The  Sky  Challenger  was  placed 
on  the  VME  bus  of  a  SUN  4/330  workstatioiL  Our  plan  was  to  use  the  SkyBolt  signal 
processing  board  for  the  recognition  search.  This  would  also  be  connected  directly  to  the 
same  VME  bus,  and  the  vector  quantization  output  fitom  the  C30  processors  would  be  fed 
directly  into  the  Intel  860s  on  Ae  SkyBolt  However,  there  were  problems  with  the  C 
compiler  on  the  SkyBolt  that  had  a  huge  enough  memory  to  run  the  recognition  program. 
Therefore,  for  the  time  being  we  ran  the  recognition  directly  on  the  SUN  4  woricstation. 
We  found  that  since  we  had  sped  up  the  recognition  so  much,  the  recognition  was  able  to 
run  in  almost  real  time  on  the  SUN  4  workstation  witiiout  the  need  for  further  accelerator 
boards! 

We  implemented  a  demcmstration  of  speech  recognition  in  which  the  speaker  says  a 
sentence  when  prompted  to  do  so.  While  the  speaker  is  speaking,  the  signal  processing 
and  vector  quantization  is  being  performed  in  real  time.  The  forward  pass  recogiution 
search  is  also  proceeding,  as  fast  as  possible.  Shortly  after  the  speakm’  stops  speaking,  die 
system  has  fiitished  the  recognition  of  the  most  likely  sentence,  and  prints  the  answer  <mto 
the  screen.  It  plays  the  speech  back  to  the  uso*  so  that  s/he  can  verify  tiiat  the  answer 
is  ccnrect  (In  a  spoken  language  system,  this  answer  will  be  fed  to  the  natural  language 
component  for  understanding.)  Meanwhile,  the  system  performs  a  backward  search  for 
the  N-Best  sentences  using  the  forward  pass  to  speed  up  the  search  tremendously.  The 
top  N  sentences  are  then  displayed  on  the  screen — ^with  their  corresponding  acoustic  and 
statistical  language  model  scores.  The  backward  pass  is  fast  enough  so  that  it  is  always 
finished  long  before  the  sentence  has  been  replayed  to  the  speaker.  (In  the  spoken  language 
system,  these  N  answers  will  be  made  available  to  the  natural  language  component,  in  case 
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the  first  choice  sentence  did  not  parse  or  did  not  make  sense.) 

We  created  a  speaker-independent  speech  model  using  the  speech  of  eight  male  speakers. 
The  training  data  of  each  speaker  was  first  quantized  using  a  male-dependent  codebook, 
and  then  a  speaker-dependent  model  was  estimated  for  each  of  the  speakers.  Finally,  the 
eight  models  were  averaged  to  create  a  male-dependent  speech  model.  We  repeated  the 
same  steps  for  the  7  females  available.  The  demonstration  used  a  statistical  first-order  class 
grammar.  We  defined  548  word  classes.  Most  wrads  were  in  fiieir  own  class,  but  words 
that  are  completely  interchangeable,  like  luunes  of  ships  and  ports,  and  the  digits  and  days 
of  the  week,  are  grouped  together  into  classes.  The  perplexity  of  the  resulting  grammar 
was  22.  However,  all  sequences  of  words  are  possible,  since  we  assume  that  any  class  can 
be  followed  by  any  other  with  some  small  probability. 


January  1991:  Real-Time  Spoken  Language  System 


For  the  next  demonstration,  in  January  of  1991  we  connected  the  real-time  BYBLOS 
N-Best  recognizer  described  above  to  a  natural  language  understanding  ctxnponent  and 
demonstrated  a  complete  real-time  HARC  spoken  language  system.  The  speech  and  NL 
processes  communicate  through  an  ethemet  coimection,  thus  enabling  our  real-time  SLS  to 
run  on  different  types  of  machines,  if  desired.  Since  the  communication  data  rate  between 
the  speech  and  NL  components  is  very  low,  the  ethemet  connection  does  not  cause  any 
bottleneck  in  the  integration. 

The  way  the  system  worked  was  as  follows:  As  the  speaker  says  an  utterance,  the 
speech  component  first  performs  the  forward  pass  of  the  Forward-Backward  recognition 
search.  As  soon  as  the  speaker  has  stopped  speaking,  die  most  likely  sentence  hypothesis  is 
sent  to  the  NL  component,  which  begins  to  parse  it  Simultaneously,  the  speech  component 
performs  the  backward  pass  to  find  the  N-Best  hypotheses  (this  backward  pass  is  completed 
in  a  fraction  of  teal  time).  The  N-Best  hypotheses  are  then  sent  to  the  NL  component,  which 
goes  through  each  of  the  hypotheses  until  it  finds  one  that  is  linguistically  meaningful.  The 
highest-scoring  linguistically-meaningful  hyptkhesis  is  taken  as  what  was  said  and  is  used 
to  perform  the  database  query  or  command.  Typically  the  response  to  a  query  or  ctmunand 
appears  on  the  screen  within  only  a  few  seconds  of  the  end  of  an  utterance. 

One  other  interesting  feature  of  the  system  is  that  the  user  can  eidier  speak  or  type  a 
query  at  any  point 
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7.4.2  February  1992:  Real-Time  SLS  using  BBN/RUBY 


At  the  workshop,  we  demonstrated  two  example  systems  that  employ  the  BBN  RUBY 
speech  recognition  system.  Both  demonstrations  run  on  Silicon  Graphics  workstations 
(Personal  IRIS  4D/35  and  Indigo),  which  contain  a  built-in  programmable  A/D-D/A.  The 
signal  processing  and  vector  quantization,  which  runs  in  a  separate  process  from  the  recog¬ 
nition  search,  commurucates  witii  the  recognition  search  via  network  sockets.  We  have 
reduced  the  computation  required  for  this  front  end  processing  to  the  point  where  it  re¬ 
quires  little  enough  of  the  CPU  so  that  there  is  enough  left  over  to  perform  the  more 
expensive  search  in  real  time.  Since  accuracy  is  our  prinuuy  concern,  we  have  verified 
that  this  signal  processing  results  in  the  same  accuracy  as  our  previous  signal  processing 
software. 


Real-Time  ATIS  System 


The  ATIS  demonstration  integrated  BBN’s  DELPHI  natural  language  understanding  system 
with  the  RUBY  speech  recognition  component  RUBY  is  used  as  a  black-box,  controlled 
entirely  through  an  application  programmers  interface  (API).  The  natural  language  compo¬ 
nent  is  our  current  research  system,  which  runs  as  a  separate  process.  Botii  processes  run 
on  the  same  processor,  although  not  at  the  same  time.  The  NL  processing  is  perfoimed 
strictly  after  the  speech  recognition,  since  competing  for  the  same  processor  could  not  make 
it  faster. 

The  speecn  recognition  comptment  perftnms  three  separate  steps.  First  it  uses  a  for¬ 
ward  Viterbi-like  computation  to  find  the  1-Best  speech  answer  in  real  time  as  the  user  is 
speaking.  Immediately  after  the  user  stops  speaking,  it  displays  the  1-best  answer.  Then,  it 
performs  a  backwards  N-Best  pass  to  find  the  N-Best  hypotheses.  Finally,  we  rescore  each 
of  the  text  hypotheses  using  a  higher-order  n-giam  statistical  class  grammar  and  reorder 
the  hypotheses  accordingly.  In  this  application,  we  use  a  trigram  model;  this  rescoring 
computation  requires  very  miiumal  processing.  (Note  that  at  this  time  we  have  omitted 
the  acoustic  rescoring  stage  in  which  we  could  rescore  with  between-word  triphone  models 
and  semi-continuous  HMM  models.)  After  that,  the  N-Best  answers  are  sent  to  DELPHI 
which  searches  the  N-Best  answers  for  an  interpretable  sentence  and  then  finds  the  answo* 
to  the  question. 

Normally,  the  system  displays  the  1-Best  answer  within  a  half  second  of  detection  of 
the  end  of  speech.  This  is  fast  enough  that  it  feels  instantaneous.  (This  speed  is  in  marked 
contrast  with  the  other  near  real-time  demonstrations  shown  at  the  workshop,  which  all 
required  from  two  to  five  times  real  time,  resulting  in  a  delay  that  was  at  least  equal  to  the 
length  of  the  utterance — usually  several  seconds.)  Next  it  performs  the  N-Best  recognition. 
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and  then  interprets  the  answer.  The  N-Best  recognition  usually  runs  in  less  than  1  sectmd, 
since  it  is  sped  up  immensely  by  the  forward  pass.  The  rescoring  of  the  N-Best  hypotheses 
with  trigram  grammar  requires  essentially  no  time.  The  time  required  for  the  interpretatitHi 
depends  on  how  many  of  the  speech  answers  must  be  considoed,  and  on  how  complicated 
a  retrieval  results.  In  most  cases,  this  phase  requires  only  another  second  or  two. 

To  operate  the  demmistration  system,  the  user  clicks  on  die  “Push  to  Talk”  window 
at  the  top  of  the  screen.  The  status  will  change  from  “ready”  to  “listening”.  As  so(»  as 
the  user  begins  speaking,  the  status  will  change  to  “beginning  of  speech”.  When  (s)he 
stops  speaking,  it  will  change  to  “end  of  speech”.  The  system  briefly  displays  its  status 
as  “First-Best”  and  “N-Best”  while  it  completes  these  phases  of  die  recognitim.  Hnally, 
the  system  wUl  “Interpret”  the  query,  which  includes  all  parsing,  semantic  interpretation, 
discourse  modeling,  and  data  retrieval  from  the  actual  database. 

The  answer  displayed  in  the  speech  window  first  contains  the  answer  firom  the  1-Best 
pass,  then  the  top-choice  of  the  N-Best,  and  finally  the  sentence  chosen  by  DELPHI.  The 
N-Best  hypotheses  are  displayed  at  the  bottom  of  the  screen  for  informadmi  only.  Then, 
the  answer  to  the  query  is  displayed  under  die  recognized  sentence.  If  the  answer  to  be 
displayed  is  larger  than  will  fit  in  the  window,  it  can  be  scrolled.  A  history  of  the  previous 
four  sentences  are  shown  in  a  window  that  can  scroll  all  the  way  back  throng  die  previous 
questions.  If  the  user  wishes  to  review  any  of  the  previous  answers  in  mme  detail,  they 
may  mouse  on  the  arrow  to  the  right  of  the  question,  which  brings  back  a  copy  of  the 
question  and  answer  in  a  separate  window  that  may  be  placed  and  sized  as  desired,  and 
dien  used  for  reference  as  long  as  needed. 

To  the  right  of  the  main  display,  we  also  display  the  discourse  state,  which  consists  of 
the  set  of  constraints  diat  were  used  to  answer  die  query.  In  this  way,  the  user  can  verify 
how  much  of  the  previous  context  was  actually  used  to  answer  the  question.  As  each 
successive  query  is  interpreted,  the  system  may  add  new  constraints,  modify  old  ones,  or 
completely  reset  the  context  The  user  may  also  reset  the  discourse  state  by  speaking  any 
of  the  commands,  “BEGIN  SCENARIO”,  “END  SCENARIO”,  or  “NEW  SCENARIO”. 


Real-Time  Speech  Recognition  for  Air-Hraffic  Control  Applications 


At  the  February  1992  DARPA  Workshop,  we  also  demonstrated  die  BBN  RUBY  system 
configured  for  air-traffic  control  (ATC)  applications.  The  demonstration  was  developed 
under  a  separate,  government-funded  effort  This  system  is  notable  for  its  hi^  speed, 
accuracy,  robusmess,  and  reliability,  all  necessary  qualities  for  ATC  applications  requiring 
human-machine  interaction.  Such  applications  include  training  systems,  where  the  trainee 
controller  interacts  with  a  simulated  world,  and  operational  environments,  where  a  con¬ 
troller’s  interaction  with  a  pilot  could  automatically  generate  a  database  retrieval  request 
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for  flight  plan  information. 

In  this  demonstration,  the  system  extracts  the  aircraft  flight  identification  from  an  utter¬ 
ance  as  soon  as  that  information  has  been  spoken.  For  example,  if  the  controller  says,  ‘Delta 
three  fifty  seven  descend  and  maintain  2000”,  the  flight  information  could  be  ci^tured  fin- 
display  or  other  uses  by  the  time  the  controller  has  said  the  first  syllable  of  “descend”.  To 
achieve  this  immediate  response,  the  speech  recognition  detects  when  the  amtroller  has 
completed  the  flight  identifier  and  is  speaking  the  test  of  the  utterance.  This  requites  a 
different  process  than  is  usually  used  for  speech  recognition.  Normally  we  wait  until  the 
end  of  the  sentence  to  determine  the  most  likely  word  string  for  the  complete  uttnance. 
For  this  application,  the  system  stops  the  recognition  process  as  soon  as  it  determines  that 
it  is  most  likely  to  have  the  complete  flight  information. 

Another  unique  capability  that  is  demonstrated  here  is  die  capability  to  reject  the  flight 
ID  if  it  is  not  in  a  specific  closed  set  Again,  this  is  done  by  explicitly  modeling  the 
likelihood  that  the  user  has  spoken  a  flight  ID  other  than  the  set  that  is  expected  at  any 
given  time. 


Chapter  8 


Detecting  and  Adding  New  Words 


Another  major  area  of  work  during  this  amtract  was  related  to  die  use  of  new  words,  diat 
is,  words  outside  the  initial  vocabulary  of  the  system.  In  a  large  vocabulary  system,  die 
user  will  invariably  use  words  dutt  are  nttt  in  the  current  vocabulary.  As  desoibed  below, 
the  usual  system  behavim-  will  be  quite  amfusing  and  firusttadng  for  the  user.  Therefore, 
it  would  be  quite  useful  if  the  system  could  alot  die  user  when  a  new  word  was  spokmi — 
even  though  it  could  not  recognire  the  new  wad.  Once  the  need  for  a  new  word  has  been 
identified,  the  user  must  add  this  word  to  the  system  without  stopping  to  collect  addidonal 
speech  training  data  or  restarting  die  whole  system.  The  issue  is  to  define  aU  die  necessary 
infomation  for  the  new  word.  These  include  its  ardiogrq>hic  spelling,  its  phonetic  qielling, 
and  where  it  fits  in  the  language  model. 

In  this  effort  we  developed  what  we  believe  to  be  the  very  first  method  for  detecting 
new  words.  We  also  develcqied  a  novel  method  for  adding  a  new  word  to  die  system’s 
vocabulary.  Details  are  given  below. 


8.1  Detecting  New  Words 

8.1.1  The  New- Word  Problem 


Current  continuous  speech  recognitioi  systems  are  designed  to  recognize  words  within  the 
vocabulary  of  the  system.  When  a  new  word  is  spdcen,  die  systems  recognize  odier  words 
that  are  in  the  vocabulary  in  place  of  the  new  word.  When  this  happens,  the  user  does 
not  know  that  the  problem  is  that  one  of  the  words  spoken  is  not  in  die  vocabulary.  He 
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assumes  that  the  system  simply  misrecognized  the  wotd,  and  thoefore  he  says  the  sentence 
again  and  again.  Cunent  systems  do  not  tell  the  user  what  the  problem  is,  wdiich  could  be 
very  frustrating. 

Adding  the  ability  to  detect  new  words  automatically  is  desirable  and  improves  die 
performance  of  the  system.  Once  a  new  word  is  detected,  it  is  possible  to  add  die  word 
to  the  vocabulary  with  some  extra  information  from  the  user,  such  as  repeating  the  wrad 
within  a  carrier  phrase  and  typing  in  the  spelling  of  die  word. 


8.1.2  Approach 


An  obvious  zero-ordo'  solution  for  detection  of  new  words  problem  is  to  i^ly  some 
rejection  threshold  on  die  word  score.  If  the  score  reaches  a  level  hi^er  dian  the  threshold 
then  a  new  word  is  detected.  However,  when  we  examined  the  scores  of  words  in  a 
sentence,  we  found  that  the  score  of  correct  words  varies  widely,  making  it  impossible  to 
tell  whether  a  word  is  correct  or  not  Therefore,  diis  approach  for  detecting  new  words  did 
not  work  well. 

Our  proposed  solution  is  to  develop  an  explicit  model  of  new  words  that  will  be  detected 
whenevo-  a  new  word  occurs.  The  word  mo^l  should  be  general  enough  to  nqnesent  any 
new  word.  It  should  score  better  tiian  other  words  in  the  vocabulary  where  there  is  a 
new  word;  it  must  always  score  worse  on  words  in  the  vocabulary  than  the  model  of  that 
word.  Given  die  above  assumptirms,  we  tried  four  acoustic  models  of  new  words  which 
are  described  below. 


Proposed  Modeb  For  A  New  Word 


All  new-word  models  consist  of  sequences  of  phonemes.  Each  phoneme  b  represented  by 
a  3-state  Hidden  Markov  Model  (HMM).  The  states  ate  connected  from  left  to  right  with 
“self-loops”  on  each  state.  There  ate  transiticm  probabilities  on  these  cminections.  Also, 
associated  witii  each  state  is  a  spectral  distribution  for  the  VQ  clusters  (256  clusters). 

The  first  new-word  model  we  tried  consisted  of  four  phonemes  (see  figure  8.1).  It  has 
5  states  and  4  identical  phonemes  with  an  adjustable  fiat  spectral  distribution.  From  now 
on  we’ll  refer  to  this  model  as  Ml. 

The  second  new-word  model  that  we  tried  was  a  model  that  allows  for  any  sequence 
of  phonemes  of  at  least  two  phonemes  long  (see  figure  8.2a).  The  model  has  3  states. 
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X:  phonemt  HMM  with  3  flat  spectral 
distributions 


Figure  8.1:  Flat  New>Wonl  Model  (Ml). 

all  phonemes  in  parallel  from  the  first  state  m  the  sectmd  state,  aU  phmemes  in  parallel 
from  the  seccmd  state  to  the  third  state  and  ail  phcmmnes  in  parallel  looping  cm  the  sectmd 
state.  The  model  has  3N  phoneme  arcs,  where  N  is  the  number  of  phonemes  used  in 
the  system  (N=S3  in  our  system).  AU  phcxurnes  are  represented  with  ctmtext-independcmt 
phoneme  models.  Note  that  diis  is  in  contrast  tt>  the  normal  vocabulary  of  die  system,  which 
uses  context-dependent  phoneme  models.  The  ctmtext-independent  phcmeme  models  in  die 
new-word  model  are  trained  on  the  same  data  as  the  system  vocabulary.  We’U  rdfer  to  diis 
model  as  M2. 

The  third  new-word  model  we  used  was  similar  to  the  sectmd  model  (see  figure  8.2b). 
It  has  5  states  with  a  minimum  of  4  phtn^mes.  The  mottel  has  SN  phoneme  arcs.  AU 
phoneme  models  are  context  independent  We’U  refer  to  diis  model  as  M3. 

The  fourth  word  model  was  a  dqihone  word  model  (see  figure  8.3).  It  consisted  d 
models  of  the  phonemes  in  the  context  of  the  previous  phcmenre  (dqihones).  It  aUows  for 
any  sequence  of  diphones  with  a  minimum  of  two  diplxmes.  This  model  has  N-f2  states, 
N  is  the  number  of  phonemes  used  in  the  system  and  the  2  for  die  beginning  and  end  word 
boundaries.  It  has  ^  2N  left-context  phoneme  arcs.  Wb’U  refer  to  diis  model  as  M4. 


Graimnar 

We  used  a  first  order  statisdcal  class  grammar.  A  stadsdcal  class  grammar  ccmsists  of 
class  nodes,  word  arcs  and  arcs  between  classes.  Each  class  node  has  a  number  of  word 
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Figure  8.2:  3-State  and  S-State  New-Wcnl  Models  (M2  and  M3). 

arcs  emerging  £rom  it  Word  arcs  in  the  same  class  have  equal  probabilities.  There  are 
transition  probabilities  between  the  classes  diat  depend  cm  tte  training  given  tt>  the  class 
grammar. 

New  wofds  are  more  likely  to  appear  in  open  classes  dum  closed  classes.  Opoi  classes 
are  the  classes  dtat  accqn  new  words  (e.g.,  ship  names,  pon  names)  as  opposed  to  closed 
classes  that  do  not  accept  new  words  (e.g.,  months,  week-dajrs,  digits).  We  created  a 
separate  new-waid  model  for  each  open  class.  Also,  it  is  easy  to  add  die  open  class  words 
to  statistical  class  grammars  and  to  Natural  Language  syntax  and  semantics. 
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Figure  8.3:  Diphtme  New-Wcrd  Model  (M4). 

8.U  Experiments  and  Results 

The  experiments  we  describe  here  use  the  DARPA  lOOO-word  Resource  Management 
Database  for  ctmtinuous  speech  recognition  and-  BYBLOS,  die  BBN  ccmtinuous  ^leech 
recognition  system[31].  The  database  is  limited  to  1000  words  and  does  not  inchvV 
to  test  for  new  words.  Therefore,  we  simulated  new  words  in  die  system  simply  by  re* 
moving  words  from  the  1000  word  vocabulary  diat  occur  in  the  test  swur-nces, 

In  the  following  we  give  results  for  experiments  dut  used  die  four  word  models  for  a 
new  word.  The  experiments  were  run  cm  7  speakers  from  the  speaker  dependent  portion  of 
the  database  (BEF,  CMR,  DMS,  DTB,  DTD,  JWS  and  PGH),  25  test  sentraces  per  speaker. 
We  created  a  statistical  class  grammar  from  the  remaining  words  in  die  vocabulary.  We 
varied  the  perplexity  of  the  statistical  class  grammars  simply  by  changing  die  number  of 
training  sentences.  A  hi&sfor  new  words  was  implemented  in  the  case  of  the  flat  new*wotd 
model  (Ml);  a  bias  against  new  words  was  implemented  to  reduce  die  false  alarm  rale  in 
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the  3-state,  S-state  and  diphone  models  (M2,  M3  and  M4).  The  bias  is  a  scaling  factor 
which  is  multiplied  by  the  new  word  arc  probability  from  an  open  class,  reducing  the 
probability  of  selecting  the  new  word  from  that  class  with  respect  to  the  rest  of  die  words 
in  the  class. 

We  ran  experiments  to  detect  new  words  from  die  classes  stop  name  (e.g.,  Downes) 
and  ship  name  possessive  (e.g.,  Downes’s).  In  these  eiqieriments  we  removed  frt»n  the 
vocabulary  45  ship  names  and  their  correspmiding  possessives  (90  words).  The  175  test 
sentences  included  59  occurrences  of  new  ^p  names.  Also,  we  ran  an  experiment  mi 
detecting  new  words  from  the  class  port  name.  In  this  expmiment  we  removed  11  port 
names  from  the  original  vocabulary  and  the  test  sentences  had  (me  occurrence  of  each. 
Then,  we  ran  experiments  on  detecting  new  words  from  7  diffoent  classes  at  the  same 
time.  In  these  experiments  we  removed  a  total  of  41  words  from  the  original  v(x:abulary 
from  the  classes  ship  name  (10  words),  ship  name  possessive  (10  w(»ds),  port  name  (8 
words),  water  name  (5  words),  capability  (4  wends),  land  name  (4  words).  We  included 
the  class  track  name  but  did  not  remove  words  from  this  class  because  the  test  sentences 
did  not  have  any  words  from  this  class.  The  175  test  sentences  have  62  (xxnirrences  of 
new  words. 

Using  the  m(xlel  Ml  we  ran  an  experiment  to  detect  new  words  from  the  classes  ship 
name  and  ship  name  possessive  without  distinedon  between  the  two  classes.  The  results 
showed  no  detections  or  false  alarms.  When  we  tuned  the  results  with  a  bias  for  the  new- 
word  model,  the  detection  rate  was  67%  and  the  false  alarm  rate  was  51%.  This  m(xlel  is 
not  useful  since  it  has  a  very  high  false  alarm  rate. 

We  ran  experiments  using  the  mtxlel  M2  and  the  results  are  shown  in  TableS.l.  A  bias 
against  the  new-word  models  was  used  to  reduce  the  false  alarm  rate.  The  columns  in  the 
table  are  described  below,  then  followed  by  an  example  as  an  illustration. 


•  classes:  the  classes  of  new  words  allowecL 

•  perp:  the  perplexity  of  the  grammar. 

•  cor:  street  or  exact  detection  rate  as  a  percentage  of  number  of  new  words.  The 
removed  word  was  exactly  replaced  by  Ae  new  word  model  of  the  same  class.  In 
the  following  example,  the  lines  marked  REF  is  what  was  actually  spoken;  the  lines 
marked  HYP  is  what  was  recogtuzed  by  the  system.  The  sentence  appears  (m  more 
than  one  line  for  clarity.  The  words  LAMPS,  of  class  capability,  and  MOZAM¬ 
BIQUE,  of  class  water  name,  wen  removed  from  the  vocabulary.  The  system  recog¬ 
nized  NEW-CAPABIUTY  and  NEW-WATER-NAME,  respectively,  exactly  in  titeir 
places. 
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SZNTEMCB  (1237) 

REF:  how  mmny  LAMPS 

EYP:  how  Bany  MSW-CAPABILITF 

REF:  crulsorn  aro  In 
HEP:  cruisara  aza  In 

REF:  MOZAMBZQXIE  channal 
HYP:  HEW-1IATER>HAME  Channal 

•  els:  dose  call  or  close  detection  rate.  That  is,  die  new  word  was  delected  but  diere 
was  an  inserdtHi  or  deletion  in  its  vicinity.  Ihe  new-word  HAWKBILL,  of  class  ship 
name,  was  recognized  correctly  as  NEW-SHIP-NAME,  but  the  word  due  DUE  was 
deleted. 

SENTENCE  (0464) 

REF:  whangs  BAMKBZLL  DOE  In 

HYP:  whan+s  NEH-SHZP-NAME  ***  in 

REF:  port 
HYP:  port 

•  sw:  switch  between  classes,  i.e.,  the  new  word  was  detected,  but  was  assigned  to 
the  wrong  class.  In  the  following  example  the  word  PEORIA  is  a  shop  name  and  die 
system  recognized  it  as  a  ship  name  possessive. 

SENTENCE  (1006) 

REF:  whan  waa  PEORZA  last 

HYP:  whan  was  NEN-SHZP-NAME<fS  last 

REF:  In  tha  atlantic  ocaan 
HYP:  In  tha  atlantic  ocaan 

•  det:  total  detection  rate,  sum  of  cor,  els  and  sw.  This  is  the  rate  of  correedy  detecting 
the  existence  of  a  new  word  in  die  vicinity  of  its  occunence.  (While  we  would  like 
the  systnn  to  detect  the  exact  location  and  die  class  of  the  new  words,  it  is  also 
useful  to  simply  detect  that  a  new  word  has  occurred). 

•  fal:  ft^se  alarm  rate,  percentage  of  number  of  false  alarms  to  die  total  number  of 
test  sentences.  A  false  alarm  is  a  new  word  detected  where  there  was  no  new  words 
in  that  part  of  the  test  sentence.  In  tire  following  example  the  word  AREA  was 
misrecognized  as  NEW-TRACK-NAME,  which  is  a  false  alarm  for  a  word  from  the 
track  name  class. 
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SENTSNCB  (1025) 

RSF:  NBZN  DID  shoxisaii  last 
HYP:  NBEN+S  TEE  shozmsa  last 

HEF:  downgrada  for  asuw 
HYP:  downgrada  for  asnw 

BET:  Bilsslon  AREA. 

HYP:  silsslon  MEtf-TRACK-HMSE 


classes 

pop 

COT 

els 

sw 

det 

fal 

shipname(+s) 

100 

"4r 

36 

5 

"ST 

1.7 

shipname(-t-s) 

60 

49 

30 

5 

84 

1.1 

portname 

100 

27 

37 

- 

64 

0.6 

7  classes 

100 

44 

6 

24 

74 

3.4 

Table  8.1:  Detecticm  of  new-words  results  using  the  model  M2. 

For  the  word  model  M2,  the  first  experiment  was  detecting  new  words  from  the  classes 
ship  name  and  r/u>  name  possessive.  The  perplexity  of  the  grammar  was  100.  We  had  a 
detection  rate  of  83%  and  a  false  alarm  rate  of  1.7%.  In  the  second  experiment  we  changed 
the  perplexity  of  the  grammar  to  60  to  measure  the  effect  of  the  perplexity  tm  the  detection 
rate  and  the  false  alarm  rate.  There  was  no  ngnificant  difiTomice  in  the  detection  rate 
(84%)  but  the  false  alarm  rate  dropped  to  1.1%,  a  reduction  of  35%  in  the  false  alarm  rate. 
Our  third  experiment  was  detecting  new  words  from  the  class  port  name  with  grammar  of 
perplexity  100.  We  had  a  detecticxi  rate  of  64%  and  a  false  alarm  rate  of  0.6%.  In  the 
fourth  experiment  we  tried  to  detect  new  words  from  7  different  classes,  widi  a  grammar 
of  perplexity  100,  the  detection  rate  was  74%  and  die  false  alarm  rate  was  3.4%. 

In  Table8.2,  we  compare  the  detection  results  using  the  new-wmd  models  M2,  M3  and 
M4.  The  experiments  were  run  on  detecting  the  same  set  of  removed  words  from  7  classes 
with  a  grammar  of  perplexity  1(X),  and  similar  bias  against  the  new  words. 


cor 

ds 

Eaina 

fall 

M2 

44 

6 

24  1  74 

£01 

M3 

37 

8 

26 

71 

4.0 

M4 

21 

14 

41 

76 

8.6 

Table  8.2:  Detection  of  new  words  results  for  M2,  M3  and  M4  with  7  new-word  classes 
and  grammar  of  perplexity  100. 

The  results  for  the  model  M2  are  the  same  as  the  last  row  in  Table8.1.  The  results  for 
the  model  M3  are  71%  detection  rate  and  4%  false  alarm  rate.  As  for  the  mocfel  M4,  the 
results  are  76%  detection  rate  and  8.6%  false  alarm  rate. 
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From  Table8.2  we  can  say  that  M2  is  the  best  new-word  model  because  it  has  the 
lowest  false  alarm  rate  and  a  high  detection  rate.  M3  results  are  very  close  to  M2  results, 
but  M2  outperforms  M3  in  all  categories.  M4  has  the  highest  detection  rate  but  it  has 
also  the  highest  false  alarm  rate.  The  high  false  alarm  rate  is  due  to  the  fact  that  the 
diphone  model  M4  matches  the  existing  words  much  better  than  the  3-state  model  M2. 
That  is,  even  though  the  diphone  model  is  somewhat  better  for  new  words,  it  is  a  much 
better  model  for  existing  words  because  it  has  been  trained  on  the  existing  words.  In 
fact,  it  matches  existing  words  better  than  their  own  models  because  it  does  not  have  the 
sequential  constraints  that  are  in  the  models  of  the  existing  words.  TherefcHe,  it  is  hard  to 
tune  the  bias  such  that  the  diphone  model  M4  detects  (mly  new  w(»ds.  In  addititm,  the 
diphone  model  requires  much  more  computation,  since  it  consists  of  all  possible  diphones. 


Figure  8.4:  Detection  Rate  vs.  False  Alarm  Rate  for  the  Speaker-Dependent  Part 


8.1.4  Speaker-dependent  vs  Speaker-Independent  Models 

To  further  test  om  new-word  detection  algmithm,  we  ran  experiments  on  the  speaker- 
dependent  and  speaker-independent  portions  of  the  DARPA  1000-Ward  Resource  Man¬ 
agement  Corpus.  In  these  experiments  we  tested  the  detection  of  new  words  fnnn  the 
classes,  ship-name,  ship-name-possessive,  port-name,  water-name,  land-name,  capability 
and  track-name.  The  results  were  obtained  by  training  the  system  on  500  sentences  which 
do  not  include  any  token  from  the  new  words  that  are  in  the  test  sentences.  A  statistical 
class  grammar  of  perplexity  100  was  used  in  obtaining  these  results.  The  detection  rate 
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and  the  false  alarm  rate  can  be  tuned  by  varying  the  bias  against  the  new  wtnds  in  the 
statistical  class  grammar.  The  bias  is  a  scalar  multiplied  by  the  probability  of  a  new  word 
in  the  class  grammar.  If  we  increase  the  bias  against  the  new  words,  the  false  alarm  rate 
decreases  with  a  minimal  loss  in  the  detecticm  rate. 

In  Figure  8.4,  we  plot  the  detection  ram  versus  the  false  alarm  ram  for  die  detecdon 
of  new  words  at  various  bias  values  against  new  words.  The  experiments  were  run  on  die 
speaker-dependent  part  of  the  Qnpus.  A  good  qierating  point  is  at  the  knee  of  the  curve 
where  71%  of  the  new  words  in  the  mst  senmnces  were  detected  and  less  than  1  senmnce 
out  of  100  senmnces  had  a  false  alarm.  This  is  a  very  useful  result  which  can  be  used  in 
various  application  to  detect  new  words  in  speech  recognition. 


False  Alarm  Rate 


Figure  8.5:  Detection  Ram  vs.  False  Alarm  Ram  ftv  the  Speaker-Independent  Part 

In  Figure  8.5,  we  plot  the  detectimi  ram  versus  the  false  alarm  ram  for  the  detectitm 
of  new  words  at  various  bias  values  for  the  speaker-independent  part  of  the  above  corpus. 
The  sysmm  was  trained  using  the  12-speaker  speaker-independent  training  paradigm  (12- 
SI)  [57]  and  a  statistical  class  grammar  of  perplexity  23.  The  best  operating  point  is  a 
detection  ram  of  50%  and  a  false  alarm  ram  of  2%. 


BBN  Systems  and  Technologies 


BBN  Report  No.  7715 


175 


Comparing  Figures  8.4  and  8.5,  we  clearly  see  that  this  solution  for  tte  detection 
new  words  works  better  for  the  speaker-depeiident  scenario  than  the  speaker-independent 
scenario,  especially  when  we  omsider  that  the  perplexity  of  the  gramnuu’  used  in  the 
speaker-independent  experiment  was  lower. 


ConcliisitHis 


From  the  above  results  we  have  shown  that  the  problem  of  detecting  new  words  can  be 
solved  by  selecting  an  explicit  word  model  for  new  wtnds.  We  tried  4  models  for  new 
w(»ds  and  compared  their  results.  The  3-state  model,  consisting  of  two  or  more  context- 
independent  phonemes,  has  a  high  detection  rate  of  74%  and  the  lowest  false  alarm  rate 
of  3.4%.  The  5-state  model  did  not  show  any  advantages  for  increasing  tire  minimum 
length  of  new  words  to  4  phonemes.  The  3-state  model  outp^ormed  it  in  all  asp«.  cts.  The 
diphone  model  had  a  high  false  alarm  rate  because  it  models  tire  existing  words  very  well. 
Reducing  the  perplexity  of  the  class  grammar  frtnn  100  to  60  does  not  affect  the  detection 
rate  significantly  but  reduces  the  false  alarm  rate. 

A  second  and  expected  conclusion  is  that  the  detection  of  new  words  is  easier  for 
speaker-dependent  models  than  for  speaker-independent  models.  Interestingly,  the  false 
alarm  rate  seems  to  be  closely  related  to  the  overall  word  error  rate. 


8.2  Adding  New  Words 


The  new-wotd  problem  in  speech  recognition  systems  has  two  parts.  The  first  part  is  to 
detect  that  the  user  has  spoken  a  new  word.  The  second  part  is  to  add  that  new  word  to  the 
vocabulary  of  the  system  after  it  is  detected.  The  two  parts  can  be  solved  independently. 
Some  speech  recognition  systems  have  a  cq)ability  of  adding  new  words.  However,  to  the 
best  of  our  knowledge,  the  BBN  BYBLOS  continuous  speech  recognition  system  is  the 
only  system  that  has  a  ct^ability  of  detecting  new  words. 

In  this  section  we  present  an  interactive  technique  for  modeling  and  adding  new  woids 
to  the  vocabulary  of  the  system. 
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8.2.1  Modeling  a  New  Word 

Once  a  new  word  is  detected,  it  is  desirable  to  add  the  word  to  the  vocabulary  of  the 
system.  The  system  needs  a  phonetic  transcription  for  the  new  word  in  order  to  build  a 
Hidden  Markov  Model  (HMM)  for  the  new  word. 

It  is  assumed  that  the  user  does  not  know  more  that  how  to  spell  or  pronounce  the  new 
word.  When  the  system  asks  the  user  to  type  the  orthographic  spelling  of  the  new  word, 
the  system  will  check  if  the  word  is  in  fact  a  new  word.  Then,  it  will  look  it  up  from  a 
large  phonetic  dictionary.  If  the  system  finds  the  phtmetic  transcription  in  the  dictionary, 
it  will  use  it  in  building  a  model  for  the  new  word.  Otherwise,  the  system  has  to  obtain  a 
phonetic  transcription  for  the  new  word  using  an  alternate  method. 

In  the  literature,  Lucassen  et  al.  [60]  and  Bahl  et  al.  [10]  have  used  an  information  the¬ 
oretic  approach  to  find  the  phonetic  transcription  of  a  word  given  its  orthographic  spelling. 
Although  the  method  had  high  accuracy,  it  needs  a  very  large  traiiung  dictionary  and  is 
very  compute  intensive. 

Below  we  present  several  possibilities  for  finding  the  phonetic  transcription  of  a  new 
word.  First,  we  present  the  phonetic  recognition  capability  of  the  BYBLOS  system.  Then, 
we  present  the  phonetic  transcription  capability  of  DECtalk.  Finally,  we  present  our  ap¬ 
proach  that  combines  the  above  two  methods  to  generate  phonetic  transcripticms  that  are 
sufficient  for  recognition  purposes. 


Phonetic  Recognition 


The  first  approach  was  to  run  the  speech  recognition  system  in  a  phonetic  recognition 
mode.  The  goal  was  to  see  if  the  phonetic  recognition  accuracy  is  high  enou^  to  be  used 
to  transcribe  new  words. 


Unit 

Grammar 

correct 

phonemes 

error 

rate 

Phonemes 

Null 

59.8 

44.0 

Phonemes 

CG 

69.4 

34.9 

Diphones 

CG 

78.8 

24.2 

Triphones 

CG 

84.4 

18.0 

Table  8.3:  Phonetic  Recognition  Performance  of  the  BYBLOS  System. 


Table8.3  shows  the  phonetic  recognition  accuracy  of  the  speech  recognition  system. 
The  table  shows  the  accuracy  for  context-inde{)endent  and  context-dependent  phoneme 
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models.  The  context-independent  models  were  tested  with  a  full-branching  grammar  (null), 
where  each  phoneme  can  be  followed  by  any  other  phoneme  with  equal  probability,  and  a 
statistical  class  grammar  (CG),  computed  from  the  phonetic  transcriptions  of  600  traiiung 
sentences.  Each  phoneme  is  placed  in  a  separate  class.  As  e^cted,  TableS.S  shows  that 
using  a  class  grammar  improves  the  phonetic  recogrition  accuracy.  Also,  the  table  shows 
that  using  context-dependent  models  (diphones  and  triphones)  improves  the  accuracy  of 
the  system. 

The  phonetic  recognition  accuracy  shown  in  Table8.3  is  not  high  enough  to  be  used  fw 
transcribing  new  words  as  will  be  seen  below.  Hence  anotiter  method  should  be  used. 


DECtalk 

The  next  approach  was  to  use  the  phonetic  transcription  capability  of  DECtalk.  We  ran 
several  experiments  which  tested  die  suitability  of  DECtalk  transcriptions  for  speech  recog¬ 
nition  purposes.  Table8.4  shows  DECtalk’s  phonetic  transcription  accuracy  for  1000  words 
from  the  Resource  Management  Corpus.  Then,  DEC^talk  transcriptions  were  used  in  word 
recognition  experiments. 


Phonetic  transcription 
source 

Correct 

phonemes 

Error 

rate 

DECtalk 

88.4 

12.5 

Table  8.4:  Hionetic  Recognition  Results  fen-  RM  1000  Words  Using  DECtalk’s  text-to- 
sound  rules. 


Phonetic  Transcription  Source 

Word  Error  Rate 

All  words  hand  transcribed 

4.4 

All  words  from  DECtalk 

212 

41  words  from  DECtalk 

6.2 

Table  8.5:  Word  Errex’  Rates  in  Speech  Recognition  Using  Several  Phonetic  Transcription 
Methods. 

Table8.5  shows  the  error  rates  on  the  1000-Word  Resource  Management  Corpus.  The 
tests  were  run  on  7  speakers  from  the  May  88  test  The  table  shows  the  perfcnmance  of  the 
system  when  all  the  words  were  hand  transcribed,  all  the  words  transcribed  by  DECtalk, 
and  finally  41  words,  designated  as  new  words,  transcribed  by  DECtalk  and  the  rest  hand 
transcribed. 

There  was  62  instaix;es  of  the  41  words  in  the  test  sentences.  The  system  recognued 
them  all  when  the  words  were  hand  transcribed.  When  just  these  words  were  transcribed 
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by  DECtalk,  die  system  recognize  48  of  the  instances  and  missed  14  (22.6%).  The  errors 
were  in  9  out  of  dw  41  words.  5  words  wm  misrecognized  all  the  time.  This  clearly  shows 
that  relying  on  DECtalk  transcriptitxis  is  not  diffident  for  recognizing  the  new  words. 


Probabilistic  Dransformation  of  DECtalk  transcriptions 

To  get  a  more  reliable  phonetic  transciiptitm  for  a  new  word  we  probalnlistically  combine 
two  knowledge  sources  to  obtain  an  improved  phonetic  transcription.  The  systmn  prompts 
the  user  to  type  the  orthographic  spelling  of  the  new  word.  The  system  passes  the  filing 
to  DECtalk  which  in  turn  produces  an  initial  (possibly  eirorful)  phmietic  transciiptitm  for 
the  new  word.  Then  we  use  a  probabilistic  transformation  method  to  create  a  pronunciadtHi 
network.  Then  the  system  prompts  the  user  to  {mmounce  the  word  and  runs  in  a  phtmetic 
recognition  mode.  The  pronunciation  netwmlc  is  used  to  constrain  the  phonetic  lecognitimi 
process.  Figure  8.6  summarizes  this  phonetic  transcription  method. 


Figure  8.6:  Obtaining  a  Phonetic  Transcriptkm  for  the  New  Word  Given  its  Orthographic 
Spelling. 

The  probabilistic  transformation  is  performed  as  follows.  Starting  with  a  set  of  correct 
transcriptions  of  a  large  number  of  words,  we  find  the  corresponding  phonetic  spellings 
given  by  DECtalk  and  we  compute  a  confusion  matrix  between  the  two  sets  of  phonemes. 
The  confusion  matrix  is  then  normalized,  resulting  in  die  probability  that  DECtalk  will 
confuse  each  phoneme  for  each  of  the  other  phonemes.  Given  DECtalk’s  phonetic  tran¬ 
scription  for  a  new  word,  we  then  form  a  finite-state  network  diat  contains  all  die  possible 
prcmunciations  of  the  new  word,  given  the  confusion  probabilities.  Using  this  networic  to 
ctmstrain  the  possible  phonetic  sequences  for  die  new  word,  we  perform  phonetic  recog¬ 
nition  of  the  single  token  of  the  pronunciation  of  the  new  woid  provided  by  the  user.  The 
phonetic  sequence  that  results  in  the  highest  score  is  dien  our  final,  corrected  phonetic 
transcription  of  the  new  word. 


Note  that  using  BYBLOS  to  perform  phonetic  recognititm  on  the  single  token  does  not 
result  in  sufficiently  high  accuracy  fOT  our  purposes.  The  procedure  described  above  uses  a 
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very  tight  pronunciation  network,  employing  another  source  of  information  (DECtalk),  to 
constrain  the  search  and,  hence,  improve  the  phonetic  recognition  accuracy.  The  phonetic 
transcription  will  be  used  in  building  a  specific  model  for  the  new  word. 

The  total  instances  of  new  words  that  were  not  recognized  by  the  system  afiter  their 
transcriptions  were  improved  dropped  fiom  14  (22.6%)  to  2  (3%).  Table8.6  shows  tiie 
word  error  rate  when  we  used  improved  transcriptions  for  the  new  words  by  the  above 
corrective  procedure.  When  compared  with  the  results  in  Table8.S,  it  clearly  shows  that 
the  approach  is  viable  and  can  be  used  to  transcribe  new  words. 


Phonetic  Transcription  Source 

Error  Rate 

41  words  after  corrective  procedure 

4.5 

Table  8.6:  Word  Error  Rates  in  Recognition  After  the  Corrective  Procedure. 


8.2.2  Adding  a  New  Word  to  the  System 

When  a  new  word  is  detected  and  a  proper  word  model  has  been  built  for  it,  tiie  word 
is  added  to  the  dictionary  of  the  system  and  the  word  model  is  added  to  the  word-model 
database.  The  new  word  will  be  pan  of  the  system  the  next  time  it  is  used. 

Adding  a  new  word  does  not  end  at  building  a  model  for  die  new  word.  The  system 
has  to  know  how  this  new  word  fits  in  the  system  language  model.  The  system  ad^  the 
new  word  to  its  statistical  class  grammar  by  finding  from  the  user  how  the  new  word  fits 
in  the  language  model.  To  do  so,  it  prompts  the  user  with  a  list  of  classes  tiiat  allow  new 
words  and  a  sample  word  from  each  class.  The  user  picks  a  word  (or  class)  or  more  that 
the  new  word  should  belong  to. 

It  is  assumed  that  all  words  within  a  class,  in  the  statistical  class  grammar,  are  equally 
probable.  When  adding  a  new  word  to  oac  of  the  classes,  the  probabilities  of  the  words 
in  that  class  are  reset  to  \/iN  +  1),  where  N  is  die  number  of  words  in  die  class.  Other 
techniques  for  adding  new  words  to  a  language  model  are  suggested  by  [48]  and  [56]. 


8.2J  Conclusions 


In  this  work  we  have  reported  on  the  detection  of  new  words  for  the  speaker-dependent  and 
speaker-independent  paradigms.  A  useful  operating  point  in  a  speaker-dependent  paradigm 
was  defined  at  7 1  %  detection  rate  and  1  %  false  alarm  rate.  For  speaker-independent  models, 
we  detected  about  50%  of  the  new  words  with  a  2%  false  alarm  rate.  We  have  shown  that 
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it  is  possible  for  a  naive  user  to  add  new  words  simply  by  typing  the  word  and  saying  it 
once.  The  system  can  determine  a  reasonable  phonetic  prmiunciation  for  Ae  new  won!  by 
using  a  combination  of  Ae  predicted  spellings  and  the  spdcen  example  of  Ae  word. 


Chapter  9 


Direct  Cooperation  with  Other  Research 
Sites 


One  goal  of  the  project  is  to  enable  efficient  cooperative  research  to  take  place  at  multiple 
sites.  Besides  the  obvious  benefits  firom  sharing  our  ideas,  (me  of  the  most  profitable  modes 
of  C(X)peration  has  been  in  enabling  other  research  sites  to  use  our  HMM  recognition  system 
as  a  first  step  in  their  research.  Specifically,  we  have  ccx^rated  with  the  research  effort  at 
Boston  University  in  using  Stochastic  Segment  Models,  the  effort  at  BBN  in  using  Neural 
Networks,  and  the  effort  at  Unisys  in  building  the  understanding  mcxlule  of  a  Spoken 
Language  System  (SLS). 


9.1  Cooperation  with  Boston  University 


In  an  effort  to  combine  the  strengths  of  different  speech  recognition  systems  to  improve 
paformance,  we  developed  a  simple  meth<xl  far  combining  systems,  based  on  the  N-best 
paradigm  developed  at  BBN.  In  particular,  we  are  interested  in  combining  die  HMM- 
based  system  at  BBN  with  another  at  Bostcm  University  which  uses  stochastic  segment 
models  (SSM).  The  SSM  ctmsiders  each  phonetic  segmem  as  a  ^gje  entity,  and  thus 
must  explicidy  (xmsider  many  dififerent  phonetic  segmentations  to  determine  die  most  likely 
answer.  As  a  result,  it  requires  significandy  m<me  computatitm  dum  die  HMM.  However, 
preliminary  experimental  results  indicate  t^t  the  SSM  might  be  more  powerful  than  the 
HMM.  Even  if  it  turns  out  not  to  be  more  powerful,  the  informadcxi  represented  in  the 
SSM  is  different  to  some  extent  firom  the  HMM,  and  therefore  could  complement  the  HMM 
nicely. 
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To  this  end,  we  use  a  variatirai  of  the  N-best  paradigm,  as  follows.  Given  a  Speech 
utterance,  the  HMM  system  is  used  to  create  the  N  most  likely  alternative  sentence  hypothe¬ 
ses.  For  each  of  these  hypotheses,  we  also  determine  the  corresponding  HMM  phturetic 
segmentation.  The  phoneme  sequence  and  its  segmentadoi  for  each  hypothesis  is  then 
rescored  by  the  SSM  system.  The  scores  of  the  two  systems  are  then  combined,  together 
with  the  grammar  score,  to  determine  the  final  answer.  We  developed  a  program  that 
determines  the  linear  combination  of  Gog)  scores  which  maximizes  recognititm  accuracy. 

One  advantage  of  this  method  for  ctHnbining  different  systems  is  that  it  is  possible 
to  integrate  two  radically  different  systems  with  a  minimum  of  effort  Another  impOTtant 
advantage  is  that  the  computation  required  for  the  second  system  is  reduced  tremendously, 
since  it  only  needs  to  rescore  those  hypotheses  that  were  judged  as  plausible  by  the  first 
system.  In  fact,  it  is  this  second  advantage  that  allows  the  SSM  system  to  reduce  its  com¬ 
putational  load  tremendously  and,  thus,  makes  it  into  a  computationally  feasible  approach. 

The  overall  results  of  this  cooperation  were  quite  satisfying.  First,  the  N-Best  approach 
made  it  quite  easy  for  the  cooperation  to  take  place,  since  we  were  able  to  send  lists 
of  hypotheses  to  BU  and  they  were  able  to  use  them  easily.  Second,  the  computational 
requirements  for  the  stochastic  segment  model  were  drastically  reduced.  Finally,  the  com¬ 
bination  of  the  two  recognition  technologies  was  shown  to  result  in  a  small  but  statistically 
significant  improvement  in  recognition  accuracy. 


9.2  Combined  HMMs  and  Neural  Networks 


Our  research  on  the  use  of  Neural  Network  techniques  for  speech  recognition  had  two 
of  the  same  problems  that  plagued  the  effort  in  Smchastic  Segment  Models.  First,  the 
computation  requited  for  training  and  recognition  was  prohibitive.  Second,  the  accuracy 
of  the  new  model  by  itself  was  not  as  good  as  the  HMM  alone.  We  applied  die  same 
approach — that  of  using  the  HMM  as  a  preprocess  to  find  the  N-Bcst  hypotheses — to  the 
Neural  Networks  problem.  One  result  was  that  the  computational  problems  were  largely 
avoided.  The  more  important  result  was  that  we  were  able  to  demonstrate  a  significant 
advance  in  the  state  of  the  art  in  speech  recognition  by  the  appropriate  combination  of 
Neural  Networks  and  HMMs[6]. 
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I  9.3  Cooperation  with  Unisys 


Since  the  research  group  at  Unisys  does  not  do  speech  recognition  research  of  their  own, 
they  had  the  problem  of  how  to  apply  their  natural  language  research  to  the  problem  of  spo¬ 
ken  language  understanding.  They  had  initially  considered  the  tightly  cotq)led  architecture 
formulated  at  NOT  Lincoln  Laboratory.  However,  diis  proved  to  be  quite  complicated  and 
constraining.  The  N-Best  Paradigm  proved  to  be  quite  simple  and  effective.  Specifically, 
at  each  of  the  evaluations  of  the  ATIS  systems,  we  have  sent  text  of  the  N-Best  sentence 
hypotheses  output  by  the  BYBLOS  recogiution  system  directly  to  Uitisys  (later  Paramax). 
(These  were  the  very  same  lists  that  were  sent  to  the  DELPHI  system  at  BBN.)  Urtisys 
reported  that  it  took  only  hours  to  configure  their  system  to  use  this  input,  and  that  they 
foimd  that  lists  of  10-20  sentences  were  quite  sufficient  for  the  purpose  of  understanding. 


Chapter  10 

Application  in  a  Military  Domain 


In  this  chapter,  we  describe  our  work  cm  DART,  our  denumstraticm  military  application, 
during  the  course  of  this  project,  and  our  creation  of  a  videoti^  demonstrating  die  appli¬ 
cations  of  spoken  language  system  technolcigy. 


10.1  The  DART  Demonstration  Application 


The  DARPA  SLS  Program  is  developing  a  technology  that  has  been  justified,  at  least  in 
part,  by  its  potential  relevance  to  military  applications.  In  an  effort  to  demonstrate  the 
relevance  of  SLS  technology  to  real-world  military  applications,  BBN  undertoc^  the  task 
of  providing  a  spoken  language  interface  to  DART  (Dynamic  Analytical  Replanning  Tool), 
a  system  for  military  logistical  transportation  planning. 

We  discuss  the  transportation  planning  process,  describe  the  real-world  DART  sj^m, 
identify  parts  of  the  system  where  spoken  language  can  facilitate  planning,  and  describe 
BBN’s  woric  of  porting  the  HARC  SLS  system  to  the  DART  domain. 


10.1.1  IVaiisportatioii  Planning 


Logistical  transportation  planning  is  the  process  of  determining  how  to  get  people  and 
cargo  from  where  they  are  to  where  they  need  to  be.  Inter-theatre  movements  of  perscmnel 
and  supplies  around  the  world  are  currendy  planned  for  the  Army,  Navy,  Air  Force,  and 
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otho:  services  by  USTRANSOOM  (the  US  TRANSportatitm  CX}Mmand)  which  operates 
under  the  Joint  Chiefs  of  Staff. 

We  chose  TRANSCOM  as  an  SLS  applicatirai  domain  because  it  presented  a  number 
of  advantages: 


1.  The  application  involves  an  essential  military  function  and  successful  i^lication  of 
spoken  language  technology  would  be  very  desirable  to  DARPA’s  clients. 

2.  The  concept  of  planning  movements  of  peopit  and  supplies  is  understandable  in  both 
military  and  non-military  contexts. 

3.  The  application  is  non-trivial,  and  affcnds  many  opportunities  for  applying  spoken 
language  understanding.  A  series  of  demonstrations  is  possible,  widt  various  features 
at  varying  levels  of  effort. 

4.  Current  efforts  to  improve  the  planning  process  using  rum-speech  technology  have 
been  well-received,  and  cooperative  users  may  be  available  as  close  as  Scott  Air 
Force  Base  near  St  Louis. 

5.  An  unclassified  development  database  is  available  in  Oracle  on  a  Sun. 


10.1^  The  DART  System 

BBN’s  DART  (Dynamic  Analytical  Replanning  Tool)  project  sponsored  by  DARPA  and 
RADC,  demonstrated  the  operational  impact  of  AI  planning  and  scheduling  technology  on 
transportation  plaruiing  at  USTRANSCOM.  DART  addresses  aii  u.gent  need  for  fast  and 
accurate  plan  generation  and  evaluation  to  support  both  long-range,  hypothetical  planning 
and  planning  in  such  crisis-response  operations  as  those  in  Ae  Middle  East 

The  current  DART  system  [40]  is  in  use  at  Scott  Air  Force  Base  and  otho’  locations 
around  the  globe.  The  workstation  environment  which  has  been  installed  at  TRANSCOM 
to  support  DART  is  already  being  used  and  has  been  credited  vtitii  reducing  routine  plan 
analysis  from  3  days  to  1  days. 

The  architecture  of  the  DART  system  is  shown  in  Figure  10.1.  The  heart  of  the  system 
is  a  relational  database.  The  database  is  initialized  with  data  from  two  sources,  a  database 
of  transportation  characteristics,  and  a  Tune  Fliased  Force  Deployment  Database  (TPFDD). 
TPFDDs  are  usually  prepared  in  advance  to  deal  with  hypothetical  military  operations.  In 
a  crisis  situation,  the  plarmer’s  task  is  usually  to  retrieve  an  applicable  TPFDD,  and  to 
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Figure  10.1:  Architecture  of  DART 

change  it  to  fit  that  new  situation.  The  output  of  the  process  is  a  modified  TPFDD  which 
can  be  used  in  subsequent  planning  and  operational  activities. 

DART  makes  a  number  of  tools  available  to  the  planner.  These  include  a  TPFDD  editor 
for  viewing  units  and  making  changes  in  their  characteristics  and  transportation  plans,  a 
notional  ports  editor  which  allows  ports  to  be  combined  for  purposes  of  planning  and 
simulation,  a  transportation  assets  editor  which  lets  the  planner  modify  the  availability  and 
characteristics  of  various  transportation  assets,  the  RAPIDSIM  simulation  system  which 
can  “run  the  current  plan",  and  an  analysis  capability  that  enables  the  planner  to  examine 
the  output  of  a  RAPIDSIM  run  to  determine  whether  or  not  the  objectives  were  achieved 

DART  allows  a  planner  to  extract  pieces  of  pre-planned  movement  records  from  .i 
database  by  specifying  simple  constraints  on  up  to  five  items:  the  units  to  be  moved  (a  unit 
generally  contains  both  personnel  and  cargo),  the  place  of  origin  of  the  units,  their  port  of 
embarkation,  their  port  of  debaikation,  and  their  final  destination. 

The  retrieved  data  is  displayed  in  a  spreadsheet-like  window,  horiztxital  bars  showing 
the  number  of  days  each  step  of  the  transport  is  expected  to  take,  with  the  coIot  indicating 
whether  the  step  is  by  land,  sea,  etc.  An  example  of  this  window,  and  other  parts  of  the 
normal  DART  display,  is  given  in  Figure  10.2. 
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10.13  The  TPFDD  Database 


The  database  that  underlies  the  entue  planning  process  is  called  die  TPFDD  (Hme  Phased 
Force  Deployment  Database)  [51].  The  TPFDD  development  database  that  is  unclassified 
and  available  in  Oracle  has  50-100  MB  of  data,  in  13  tables  and  about  500  fields.  This  data 
represents  approximately  20,000  cargo  movement  records,  9,000  unit  movement  records, 
and  a  smaller  number  of  persormel  movement  records.  Each  record  contains,  amtmg  other 
information: 


•  location  of  origin 

•  POE  (port  of  embarkation) 

•  intermediate  locations,  if  any 

•  transportation  mode  (land,  sea,  air) 

•  transportation  provider 

•  POD  (port  of  debarkation/discharge) 

•  location  of  destination 

•  RLD  (ready  to  load  date)  at  origin 

■  ALD  (available  to  load  date)  at  POE 

•  EAD  (earliest  arrival  date*)  at  POD 

•  LAD  Gatest  arrival  date)  at  POD 

•  RDD  (required  delivery  date)  at  destination 

10.1.4  DART  plus  SLS 

Natural  language  access  (both  spoken  and  typed)  increases  the  utility  of  the  DART  interface 
by  providing  capabilities  that  are  not  available  in  the  non-language  interface,  and  it  can 
decrease  the  task  completion  time  for  operations  that  can  be  expressed  mote  concisely  in 
words  than  in  mouse  actions. 

We  identified  six  areas  of  the  DART  system  where  natural  language  will  provide  in¬ 
creased  functionality  for  this  military  system: 
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1.  the  TPFDD  editor,  which  allows  users  to  create  and  modify  entries  in  the  Timed 
Phased  Force  Deployment  Databases  that  specify  movement  requirements  for  the 
personnel  and  materiel  involved  in  planned  military  operations, 

2.  the  transportation  assets  editor,  which  is  used  to  view  and  change  the  number  and 
type  of  transportation  assets  (ships,  planes,  etc.)  and  the  days  when  they  are  available, 

3.  the  notional  ports  editor,  which  is  used  to  combine  acmal  ports  (sea  ports,  air  ports, 
or  other  geographical  locations)  into  single  “notional”  ports  to  simplify  subsequent 
simulaticnis  of  planned  movements, 

4.  the  analysis  of  results  fiom  the  RAPIDSIM  simulation  of  the  current  plan’s  execution, 

5.  imiversal  (that  is,  available  throughout  the  whole  DART  system)  access  to  information 
in  the  TPFDD  database  that  underlies  the  plaiming  system, 

6.  menu  navigation  through  the  DART  system,  so  that  a  user  can  use  a  single  verbal 
command  instead  of  a  lengthy  sequence  of  mouse  (and  possibly  keyboard)  operations. 


Each  of  these  opportunities  for  adding  spoken  language  to  the  DART  interface  has 
separate  pros  and  cons.  They  vary  in  expected  vocabulary  size,  likely  language  complexity, 
ease  of  interface  to  DART,  and  utility  for  the  user. 

For  example,  in  the  nodtmal  ports  editor,  the  user  is  likely  to  want  to  give  short  com¬ 
mands  to  the  system  (“Show  me  Travis  Air  Force  Base”,  “Zoom  in  around  C3iarleston”, 
“What’s  this  port?”,  “Show  the  nearest  military  airport”,  “Ckrmpute  the  notional  port  as- 
sigrunents”).  The  planner  is  also  likely  to  refer  only  to  the  geographical  locations  that  are 
displayed  on  the  current  map,  which  reduces  the  vocabulary  (and  the  perplexity)  consider¬ 
ably. 

Universal  database  query,  on  the  other  hand,  wiU  involve  complex  language  (“What 
percentage  of  the  Navy  units  headed  for  ah'  forex  bases  in  Tbnisia  that  are  available  to  load 
from  US  ports  prior  to  day  20  contain  hazardous  cargo?”).  This  part  of  the  application 
will  also  require  a  very  large  vocabulary,  since  virtually  any  geographic  location  or  other 
word  from  fhe  database  can  be  used  in  a  query.  We  estimate  that  even  for  just  a  good 
demonstration,  the  vocabulary  will  need  to  be  about  5000  words. 


10.1.5  Current  Status 


By  the  end  of  this  project,  we  transferred  from  the  small  in-core  planning  database  that  we 
developed  for  demonstrating  HARC  to  using  the  real  TRANSCOM  development  database  in 
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Oracle.  We  developed  an  initial  interface  between  the  DART  user  interface  and  the  windows 
indicating  activities  in  speech  processing  and  NL  understanding.  We  have  Implemented  a 
mechanism  to  allow  units  that  are  retrieved  via  a  natural  language  query  to  be  impmied 
into  the  DART  plan  display. 


10^  Videotape 


To  demonstrate  our  various  capabilities  in  real-time  spolren  language  systems,  we  developed 
a  videotape  that  included  demonstrations  of: 


•  The  real-time  N-best  BYBLOS  system  working  in  speaker-independent  mode  in  the 
resource  management  domain. 

•  Speaker  adaptatitm  to  improve  reception  performance  for  ntm-native  speakers  of 
American  English. 

•  The  real-time  HARC  spoken  language  system  in  the  military  logistical  planning 
domain,  showing  the  combination  of  speech  recognition  and  natural  language  under¬ 
standing. 

•  The  DART  application  and  the  DART+SLS  demmistration  of  the  use  of  sptdcen 
language  to  enhance  system  capabilities  in  one  of  the  six  areas  described  in  the 
previous  section. 


The  DART+SLS  demo  describes  the  task  of  TRANSCX)M  planners  and  shows  examples 
of  the  interactions  that  are  possible  with  spoken  language  to  facilitate  the  plarmer’s  work. 
For  example,  because  the  DART+SLS  system  has  independent  access  to  the  system’s 
database,  the  user  can  ask  questions  about  the  data  which  would  have  been  difficult  w 
impossible  to  answer  in  the  existing  DART  system.  Also,  certain  queries  that  currently 
require  a  large  number  of  mouse  clicks  can  be  asked  with  a  single  spedren  utterance. 

This  videotape  was  presented  at  die  February,  1991  DARPA  Speech  and  Natural  Lan¬ 
guage  Workshop. 
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Figure  10.2;  The  DART  User  Interface 


Chapter  11 


Common  Data  Collection  and 
Evaluation 


11.1  Development  of  a  Performance  Evaluation  Methodology 


While  the  Continuous  Speech  Recognition  (CSR)  community  has  had  an  established  mediod- 
ology  for  performance  evaluation  on  a  common  corpus  for  several  years,  there  was  no 
similar  methodology  for  the  evaluation  of  natural  language  systems  at  of  spoken  language 
systems  at  the  beginning  of  this  contract  BBN  played  an  important  role  in  ^  development 
of  a  common  evaluation  methodology.  BBN  ^tailed  the  basic  mediodology  used  for  NL 
and  SLS  evaluation  in  [26].  This  work  specified  the  following  aspects  of  evaluaticm  which 
are  still  in  use  today: 


•  Evaluation  utilizes,  input-ouput  pairs. 

•  Input  is  a  spoken  m*  written  utterance. 

•  Output  is  die  answer  retrieved  by  the  input  from  a  common  database. 

•  The  correct  answer  is  retrieved  fiom  the  conuntm  database. 

•  Both  the  reference  answer  and  the  hypodiesis  answer  to  be  evaluated  are  presented 
in  a  common  answer  specification  (CAS)  form. 

•  An  automatic  “comparator^’  is  used  to  compare  the  answers  and  scene  die  system 
output,  allowing  fen  some  unimportant  differences  in  presentation  of  die  data  (such 
as  order  of  elements)  to  be  disregarded. 
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BBN’s  proposal  [26]  also  detailed  the  requirements  for  an  automatic  software  com¬ 
parator  which  would  match  a  hypothesis  CAS  produced  by  an  NL  or  SLS  system  under 
evaluation  with  a  reference  CAS.  (Among  such  requirements  are  the  necessity  for  a  epsilon 
factor  for  numeric  comparison  of  floating  point  numbers.)  These  proposals  were  finally 
adopted  by  the  DARPA  community  with  little  modification.  BBN  also  produced  the  first 
comparator  program  in  LISP,  which  was  made  available  to  the  SLS  community.  The 
C-based  comparator  produced  by  NIST  eventually  recapitulated  the  functionality  of  this 
system.  For  sites  which  do  not  possess  Oracle  databases,  BBN  also  documented  its  Com¬ 
mon  LISP-based  data  base  and  its  retrieval  language,  ERL,  [77],  and  sent  the  documentation 
and  a  no-cost  license  agreement  for  the  data  base  interface  to  CMU,  Dragon  Systems,  MIT, 
NIST,  SRI  International,  TI,  and  Unisys,  thereby  making  the  database  software  available 
to  the  general  DARPA  SLS  community. 

We  also  implemented  an  early  Wizard  of  Oz  System  [26],  documented  it,  and  delivered 
it  to  Texas  Instruments  so  that  they  could  gather  realistic  speech  input  to  a  natural  language 
interface  to  a  database.  We  trained  several  H  employees  to  serve  as  Wizards. 

We  have  continued  our  strong  participatitm  in  developing  a  methodology  for  common 
evaluation  of  spoken  language  systems,  especially  the  evaluation  of  natural  language  under¬ 
standing  systems  [15],  [12],  [14]  For  example,  we  made  our  database  expertise  available  to 
TI  and  helped  them  in  their  effort  to  produce  a  relational  ATIS  database  from  the  original 
ATIS  data  obtained  from  OAG.  We  also  helped  in  specifying  various  aspects  of  the  Wiz¬ 
ard  data  collection  scenario  and  the  performance  evaluation  process  with  NIST,  including 
specification  of  general  templates  for  descriptions  of  the  C^R  and  NL  systems  submitted 
for  evaluation.  Examples  of  these  templates,  describing  DELPHI,  BYBLOS,  and  HARC, 
appear  in  Section  11.6 


11.2  ATIS  Data  Collection 


During  the  summer  of  1991,  we  developed  a  Wizard  system  for  collecting  data  in  the  ATIS 
domain,  collected  data  in  the  formats  agreed  upon  by  the  MADCX)W  committee,  and  sent 
the  results  to  NIST.  We  first  obtained  the  new  database  (rdb3-beta),  a  modified  vosion 
of  the  original  ATIS  database,  and  installed  this  on  our  machines.  Since  this  database 
contained  new  tables,  fields,  and  field  values,  we  next  modified  the  interface  between 
DELPHI  and  the  database  accordingly.  We  then  integrated  the  modified  DELPHI  system 
into  our  Wizard  set-up  for  collecting  data.  As  part  of  our  data  collection  effort,  we  defined 
24  scenarios,  some  of  which  we  obtained  from  MIT,  TI,  and  SRI.  These  scenarios  are 
presented  in  Section  11.3  at  the  end  of  this  chapter.  We  also  collaborated  with  SRI  in 
defining  a  common  scenario  for  cross-site  data  collection. 
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For  our  Wizard  data  collection,  we  produced  a  ‘‘Wizard”  data  collection  setup  to  elicit 
speech  from  subjects  by  presenting  them  with  a  computer  system  that  q)peared  tt>  under¬ 
stand  them  (that  is,  the  computer  presented  responses  to  what  the  subject  said),  but  which 
actually  had  a  human  “wizard”  listening  to  the  spoken  questions  and  Qq)ing  the  commands 
to  the  system  to  produce  the  answer. 

This  data  collection  setup  employed  a  novel,  interactive  subject  and  wizard  interface 
based  on  X-windows.  The  subject’s  queries  and  answers  were  stacked  on  the  color  screen 
for  later  examination  or  other  man^itiation  by  the  subject  The  system  also  used  our  real¬ 
time  BYBLOS  speech  recognition  system  as  the  front  end.  Real-time  speech  recognition 
was  used  primarily  to  discourage  subjects  from  speaking  too  sloppily  (the  wizard  had  the 
choice  of  using  the  speech  recognition  output  or  correcting  it). 

The  scenarios  that  were  used  to  elicit  speech  in  the  data  collection  setup  included  both 
trip  planning  scenarios  and  problem  solving  scenarios  that  involved  more  genotd  kinds  of 
database  access.  We  believe  that  this  data  provided  a  richer  range  of  training  language  than 
trip  planning  alone,  and  thus  will  help  to  distinguish  systems  that  are  unnecessarily  narrow 
in  focus  from  those  with  more  general  capabilities.  The  Wizard  set-iq)  used  the  following 
protocol: 


•  The  Wbcard  and  Subject  are  seated  at  separate  Sun  workstations,  each  equipped  with 
an  X-windows  display  and  a  mouse. 

•  The  ^^zard  tells  the  Subject  about  the  data  collection,  explains  the  display  and  how 
to  maitipulate  it,  gives  the  user  written  instructions  and  a  list  of  scenarios,  and  goes 
through  a  practice  scenario  to  familiarize  the  subject  with  the  process. 

•  Using  the  mouse  and  a  “control-panel”  icon,  the  Subject  specifies  the  scenario  in 
effect  for  the  current  session. 

•  During  the  course  of  the  session,  the  Subject  precedes  each  utterance  by  clicking  on 
a  software  push-to-talk  button  with  the  mouse,  to  activate  the  speech  system. 

•  The  Subject  speaks  into  a  close-talking  microphone,  which  sends  the  speech  to  our 
speech  recognition  system,  BYBLOS.  The  speech  is  digitized  and  stored  in  a  .wav 
file. 

•  BYBLOS  produces  its  N-best  hypotheses  about  the  input  utterance,  which  are  dis¬ 
played  to  Ae  >^^zaxd. 

•  The  >^zard  chooses  one  of  these,  possibly  editing  it  so  that  it  is  a  reasonable  quick 
transcription  of  what  the  speaker  said,  and  displays  it  to  the  Subject 


•  DELPHI  processes  the  query. 
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•  When  the  system  produces  an  answer,  Ais  is  displayed  to  the  Subject  If  DELPHI 
fails  to  process  the  utterance,  the  Vi^zaid  either  displays  an  error  message  to  the 
Subject  or  submits  a  modified  query  m  produce  an  answer. 

•  This  continues  until  the  end  of  the  scenario,  which  is  signalled  by  the  Subject  clicking 
on  an  “End  Scenario"  butmn  with  the  mouse. 


Each  session  produced  a  separate  logfile,  in  an  internal  format  Each  logfile  was 
processed  by  a  USP  program  to  transform  it  into  a  form  consistent  with  the  MADCOW 
specifications  for  common  logfiles.  We  also  manually  produced  the  .  sro  transcription 
files  by  listening  to  each  utterance.  We  obtained  a  program  from  NIST  for  checking 
conformance  of  our  digitized  speech  (.wav)  files  with  NIST’s  standard,  to  ensure  their 
correcmess  before  shipping  them  to  NIST. 

We  used  this  data  collection  system  to  collect  data  from  62  subjects,  6  of  whom  came 
back  for  a  second  session.  We  collected  over  2200  sentences  in  all  and  delivoed  diem 
to  NIST,  fulfilling  the  cross-site  agreement  to  collect  this  amount  of  data  by  the  end  of 
August  BBN,  in  fact  was  the  only  site  to  collect  the  targeted  number  of  accqitable  queries 
by  the  original  deadline  of  1  September,  1991.  We  present  a  breakdown  of  the  data  by 
gender  and  by  scenarios  in  Table  11.1.  In  addition,  we  submitted  to  NIST,  as  required, 
the  text  of  all  scenarios  as  well  as  the  instructions  given  to  subjects.  Our  scenarios  are 
presented  in  Section  11.3  and  the  subjea  instruction  in  Section  11.4.  We  also  include  an 
actual  sample  session  in  Section  11.5. 


11.3  BBN  AXIS  Scenarios 


TMs  section  contains  all  the  scenarios  used  for  data  collection  by  BBN  during  the  summer 
of  1991,  grouped  by  scenario  type. 


lU.l  Initial  scenario 


In  order  to  accustom  the  user  to  the  push-to-talk  button  and  adjust  volume  levels,  the  first 
query  each  subject  asked  was  the  foUowing. 


PI:  What  is  the  name  of  the  airport  in  Boston? 
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Speakers 

Male  34  1209  utterances  35.6  utterances/speaker 

Female  8  1068  utterances  38.1  utterances/speaker 

Total  62  2277  utterances  36.7  utterances/speaker 

Average  4.95  scenaiios/speaker 


Scenarios 


A  =  flight  planning  scenarios  (includes  die  cootmum  scenario  used  by  all  sites) 
B  =  problem  solving  scenarios 

C  =  short  planning  scenarios  (mostly  the  same  as  some  of  MTT’s  scenarios) 


#scenarios  #utterances 

futterances/scenarios 

A  108 

994 

9.20 

B  86 

705 

8.20 

C  113 

578 

5.12 

TotaBO? 

2277 

7.51 

Common  Scenario:  33  (10.75%  of  total  scenarios) 

Table  11.1:  Breakdown  of  BBN  Wizard  Data  by  Gender  and  Scenario 
11.3.2  Practice  Scenarios 

These  scenarios  could  be  answered  in  one  or  two  questions  and  were  designed  to  acquaint 
the  user  with  the  type  of  information  in  die  database  and  with  interacting  with  the  system. 

Cl:  Find  the  cheapest  (or  the  most  expensive)  one-way  fare  from  one  city  to  another. 

C2:  Find  the  earliest  (or  latest)  flight  flom  one  city  to  another  that  serves  a  meal  of  your 
choice. 

C3:  Determine  the  type  of  aircraft  used  <m  a  flight  frran  cme  city  to  anodier  that  leaves 
before  (or  after)  a  certain  time  of  the  day. 

C4:  Find  a  transcontinental  (east  coast  to  west  coast)  flight  on  your  favorite  airline  from 
one  city  another  that  makes  a  stop-over  in  a  city  of  your  choice. 

C5:  Determine  the  longest  day  trip  you  can  make  between  two  cities  of  your  choice.  (You 
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are  trying  to  maximize  the  amount  of  time  on  the  ground  at  your  destination) 

C6:  Find  a  flight  between  two  cities.  The  flight  shmild  leave  in  the  afternoon  and  arrive 
around  your  dinner  time.  It  should  be  a  non-stop  flight 


11.33  Drip  Planning  Scenarios 

These  are  scenarios  that  required  longer  and  more  complicated  interactions  witii  the  system. 
Each  scenario  involves  pluming  a  trip. 

Al:  Plan  the  travel  arrangements  for  a  small  family  reunion.  The  reunion  will  be  held  in 
Baltimore.  Three  family  members  will  be  attending  the  meeting; 

One  of  them,  who  typifies  the  “high  class”  life  style,  will  be  COTiing  from  Denver;  find 
first-class  travel  arrangements  for  her  (hi  United  Airlines. 

The  second  has  a  life  style  described  as  “economy”;  find  travel  arrangements  finom 
Dallas  to  Baltimore  for  the  least  amount  of  money  possible. 

The  third  has  an  “adventurous”  life  style  (he  loves  to  hang  ^de  and  fly  in  small 
airplanes);  he  lives  in  Pittsburgh.  To  make  this  pmstm’s  trip  enjoyable,  find  a  flight  from 
Pittsburgh  to  Baltimore  on  a  plane  that  can  hold  the  smallest  number  of  passengers. 

You  win  also  need  to  specify  the  date  of  arrival  for  each  of  these  family  members. 

A2:  You  live  in  Philadelphia.  You  need  to  make  a  business  trip  to  San  Francisco  next 
week.  You  have  an  old  Mend  in  Dallas  and  you’d  therefore  like  to  spend  the  afternoon 
in  Dallas  on  your  way  out  to  San  Francisco.  You’d  prefer  to  fly  first  class  tm  Americfm. 
Find  out  what  kind  of  aircraft  you’ll  be  flying  cm. 

A3:  Choose  two  cities:  city  A _ and  city  B _ 

You  live  in  city  A.  You  want  to  combine  a  business  trip  to  city  B  witii  pleasure  by 
taking  your  spouse  along,  but  you  don’t  want  to  have  to  pay  for  your  spouse’s  travel. 
Fortunately,  the  company  will  reimburse  you  for  an  amount  equal  to  the  price  of  a  first- 
class  ticket  Plan  a  trip  that  will  allow  you  to  stretch  the  travel  allowance  to  cover  the 
expenses  for  both  you  and  your  spouse. 

A4:  Choose  three  cities:  A  - - -  B - and  C - You 
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live  in  city  A.  You  have  only  three  days  for  job  hunting  and  have  arranged  two  job 
interviews  in  cities  B  and  C,  each  lasting  at  least  three  hours.  Pick  the  cities  and  plan  the 
flight  itinerary. 

A5:  Choose  city  A _ _  your  starting  point,  two  other  destination  cities  (city  B 

- city  C _ ),  and  then  solve  the  following  scenario. 

You  have  only  three  days  for  job  hunting,  and  you  have  ananged  job  interview  in  two 
different  cities.  (The  interview  times  will  depend  on  your  flight  schechile.)  Start  from  city 
A  and  plan  the  flight  and  ground  transportation  from  city  B  and  city  C,  and  back  home  to 
city  A. 

Scenario  A5  was  the  common  scenario  specified  by  BBN  and  SRI. 


11.3.4  Data  Base  Exploration  Scenarios 


These  scenarios  also  involve  longer  and  more  involved  interactions  with  the  system.  Unlike 
the  previous  scenarios,  these  encourage  the  user  to  treat  the  ATIS  system  as  a  data  base  for 
free  exploration.  We  included  these  scenarios  to  produce  natural  language  different  from 
that  used  for  trip  planning. 

B5:  You  have  heard  in  a  commercial  that  one  airline  claims  to  have  more  flights  with 
some  particular  class  (you  can’t  remember  whether  it  is  business  class,  or  first  class,  or 
some  other  class)  than  all  other  airlines  combined.  You  also  fmgot  which  airline  made  that 
claim.  Find  out  which  airline,  if  any,  can  reasonably  make  such  a  claim,  and  which  class 
the  claim  was  about 

B6:  An  airline  has  a  “hub”  in  a  city  if  it  has  a  lot  more  flights  into  and  out  of  that  city 
than  other  airlines  do.  Determine  whether  any  of  the  airlines  represented  in  the  database 
have  hubs,  and,  if  so,  which  cities  are  hubs  for  which  airlines.  (You  may  stc^  with  one 
airline  and  one  hub,  if  you  wish.) 

B7:  Pick  an  airline  and  pretend  that  you  are  in  charge  of  suggesting  new  routes  and 
schedule  changes  for  that  airline.  Investigate  the  database  to  determine  whether  diere  is 
a  gap  that  your  airline  could  fill  A  gap  could  be  a  route  between  two  cities  that  you 
don’t  currently  serve,  or  it  could  be  a  part  of  die  day  that  isn’t  covered  by  you  (or  your 
competition)  for  travel  between  cities  you  already  serve. 

B8:  Two  airlines  might  be  suspected  of  being  in  coUusion  if  they  appear  to  have  divided 
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up  territory  (or  time  of  day  in  a  single  city)  so  as  not  to  compete  with  one  another,  or  if 
they  have  nearly  identical  prices  for  the  same  service.  Pick  any  two  airlines,  and  determine 
whether  they  might  be  colluding. 

B9:  Choose  a  city _ and  a  period  of  time  during  the  day _ 

Suppose  that  the  airport  in  this  city  has  closed  for  that  period  of  time  to  minor  con¬ 
struction  work.  Of  the  airlines  that  use  this  airport,  find  the  airline  that  will  be  least  affected 
by  the  airport  closing.  Also,  find  out  what  is  fire  maximum  possible  number  of  passengers 
that  might  be  affected  (based  on  the  seating  capacity  of  the  types  of  aircraft  us^  in  those 
flights). 

BIO:  Choose  an  airport:  _  The  “congestion  period”  in  an  airport  is  that 

3-hour  period  in  which  there  are  more  take-offs  and  landings  than  in  any  other  3-hour 
period  of  the  day.  For  the  airport  you  have  chosen,  determine  its  congestion  period. 

Bll:  Choose  one  city  firom  each  of  the  following  pairs  of  cities:  Bostcxi/Washington  DC; 
Denver/San  Francisco.  Find  an  airline  which  provides  only  direct  flights  between  the  two 
chosen  cities.  Also,  is  there  an  airline  that  provides  only  cmmecting  flints  between  fiie 
same  two  cities?  If  so,  determine  the  types  of  aircraft  used  on  those  flights. 

B12:  Choose  one  of  the  following  cities:  Atlanta,  Denver,  Dallas.  Suppose  that  due  to 
bad  weather  the  airpon  in  this  city  wiU  be  closed  for  the  morning.  Find  all  the  flights  that 
make  a  connection  in  this  airport  and  that  will  be  affected.  Find  other  similar  (same  origin 
and  final  destination)  afternoon  flights  that  connect  through  the  same  airport. 

B13:  Choose  two  cities  A _ and  B _ Choose  any  three  airlines 

that  fly  ftom  city  A  to  city  B. 

You  have  always  been  curious  as  to  the  price  differences  among  the  various  airlines. 
Perform  some  research  on  price  comparisons  fmr  fiie  three  selected  airlines,  studying  flights 
from  city  A  to  city  B  with  similar  characteristics  (same  class  of  service,  similar  departure 
time,  etc).  In  your  opinion,  which  airline  is  the  most  “greedy”  and  which  is  the  least 
“greedy”  among  the  fiuee  airlines. 

B14:  Choose  an  airport: _ An  airline  dominates  a  time  slot  in  an  airpot  if 

that  airline  has  mote  takeoffs  and  landings  in  that  airport  during  that  time  slot  fiian  any 
other  airline.  A  business  time  slot  is  early  morning  or  evening.  Does  any  airline  dcmiinate 
the  business  time  slots? 

B15:  An  airline  is  “dependent”  on  a  manufacturer  if  most  of  the  flights  on  fiiat  airline  use 
planes  from  that  manufacturer.  Are  any  airlines  “dependent”  on  a  manufacturer? 
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B16:  Which  types  of  aircraft  are  used  for  transcontinental  (east  coast  to  west  coast)  flights? 
Are  any  of  these  types  of  aircraft  used  for  flights  which  are  not  legs  of  transcontinental 
flights? 

B17:  An  airline  cuts  cost  on  a  route  if  it  serves  only  a  snack  to  coach  passengers  when  it 
serves  a  full  meal  to  first  class  passengers.  Pick  two  airlines  and  find  out  if  there  is  a  route 
on  which  diey  cut  cost  in  that  manner. 


11.4  BBN  Subject  Instructions 

ATIS  Data  Collection 
Instructions  to  Subjects 

Thank  you  for  participating  in  our  data  collection  project 

1.  Getting  Acquainted 

What  is  ATIS? 


ATIS,  or  Airline  Travel  Information  System,  is  a  voice-operated  system  under  development 
at  BBN.  The  system  is  designed  to  give  answms  to  spt^en  queries  about  flight  information. 
The  flight  information  that  the  system  knows  about  is  a  subset  of  the  information  contained 
in  the  Official  Airline  Guide  (OAG).  (See  Section  2  entitled  ”What  kind  of  informatitni  is 
in  the  ATIS  system”.) 


What  is  the  purpose  of  this  data  collection  effort? 


Because  the  system  we  have  is  still  under  development,  it  will  be  able  to  answer  many  (we 
hope)  but  not  all  of  your  queries.  The  purpose  of  this  data  coUecticm  is  to  give  us  a  larger 
sample  of  queries  from  actual  users  who  are  trying  to  use  the  system.  We  plan  to  use  the 
collected  data  to  help  us  improve  the  performance  of  the  system. 
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What  am  I  supposed  to  do? 


Your  task  is  to  use  the  AXIS  system  to  help  you  in  solving  several  flight-related  scenarios 
given  to  you  by  the  data  collector  at  the  time  of  data  collection.  In  the  process  of  solving 
the  scenario,  don’t  make  a  special  effort  to  restrict  yourself  to  ask  queries  that  are  the  most 
relevant  to  that  scenario.  In  particular,  when  you  are  just  starting  a  scenario  you  may  want 
to  explore  the  information  that  ATIS  contains. 

How  to  use  the  system 


See  Section  3  for  a  cmnplete  set  of  instructions  on  how  to  use  the  system.  In  addition, 
you  will  receive  instructions  on  the  use  of  the  system  from  the  data  collector.  The  data 
collector  will  be  monitoring  the  whole  data  collection  session. 

Briefly,  you  request  informadon  from  ATIS  by  using  the  microphone  to  talk  to  the  system. 
The  screen  will  show  what  the  system  thinks  you  said  and  what  the  answer  to  your  query 
is. 

Remember;  The  purpose  of  this  process  is  to  collect  speech  data;  the  more  queries  you 
ask  the  better.  Don’t  try  to  think  of  the  shortest  way  you  can  get  your  information.  Don’t 
be  afraid  to  explore. 


2.  What  Kind  of  Information  is  in  the  ATIS  System? 


ATIS  knows  about  flights  between  a  limited  number  of  cities  (see  list  below).  For  each 
flight,  ATIS  knows  about  the  following: 


Airline  luime  (American,  Continental,  Delta,  TWA,  Eastern,  Lufthansa,  Midway, 
United,  and  USAir) 

Airline  abbreviation  (AA,  CO,  DL,  TW,  EA,  LH,  ML,  UA  and  US). 

Flight  number 
Originating  city 
Destination  city 

Departure  and  arrival  times  (also  elapsed  time  of  a  flight) 

Dates  (July  25,  7th  of  November,  etc.) 

Days  of  service  (Sunday,  Monday,  etc.) 

Qasses  of  service  (first  class,  business,  etc.) 
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Number  of  stops 

Faxes  (these  vary  depending  on  flight,  day,  class  of  service,  and  whether  the 
fare  is  one  way  or  round  trip) 

Meals 

Ground  transportation  between  airport  and  downtown,  and  its  cost 
Type  of  aircraft  (jet,  turboprop,  DCIO,  72S,  727,  etc.) 

Seating  capacity  of  aircraft 


You  may  ask  the  system  quesdons  about  any  ablneviadtms  you  don’t  understand. 


Cities  and  Airports  that  ATIS  knows  about: 


CITx /AIRPORT 


AIRPORT  CODE  ALTERNATE  AIRPORT  NAME 


Atlanta 

ATL 

William  B.  Hartsfield 

Baltimore/Washington 

BWI 

Boston 

BOS 

Logan 

Dallas/Fort  Worth 

DFW 

Denver 

DEN 

Stapleton 

Oakland 

OAK 

Philadelphia 

PHL 

Pittsburgh 

PIT 

San  Francisco 

SPO 

There  are  1 1  cities  served  by  9  airports.  Baltimore  and  Washington  are  served  by  the  same 
airport;  likewise  for  Dallas  and  Fort  Worth.  The  aiiports  in  Oakland  and  San  Francisco 
serve  both  cities. 


3.  How  to  Use  the  ATIS  System 


3.1  Scenarios 


After  an  iiutial  traiiung  period,  the  data  collector  will  give  you  several  scenarios  to 
solve  using  the  ATIS  system.  Below  is  the  procedure  to  follow  for  each  of  the  scetuffios: 

•  Begin  Scenario:  To  begin  a  new  scenario,  use  the  mouse  to  click  (with  the  left-most 
mouse  button)  the  Begin  Scenario  button  in  the  top-left  of  the  ATIS  screen  (see  Figure  1). 
A  menu  of  scenario  numbers  will  pop  up.  Click  the  identification  number  of  die  scenario 
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that  you  will  be  working  on.  (The  Begin  Scenario  button  will  now  say  End  Scenario;  sec 
below.) 

Note:  Make  sure  you  write  down  the  specifics  of  the  problem  you  have  selected  to 
solve.  This  may  avoid  getting  you  and  the  system  confused!  Use  the  notepad  provided  to 
you  by  the  data  collection  manager  to  write  down  notes  for  yourself  as  you  wish  throughout 
the  data  collection  session 

•  Query  Sequence:  You  solve  the  scenario  by  asking  the  AUS  system  a  sequence  of 
quoies.  To  ask  a  new  query: 


•  Click  New  Query  button  in  the  top-left  ctnner  of  the  ATIS  screen.  The  System 
Status  display  will  show  the  word  ’listening”,  when  tl^  system  is  ready  to  accept 
speech  input. 

•  Speak  your  query  into  the  microphone. 

•  After  a  short  wait,  the  system  will  show  on  a  new  card  what  it  thought  you  said  (see 
Figure  1). 

•  After  another  (usually  longer)  wait,  the  system  displays  on  the  card  an  answer  to 
your  query.  (Sometimes,  this  wait  can  be  quite  long;  please  be  patient) 


Repeat  the  above  sequence  for  eveiy  new  query  until  you  feel  you  have  finished  wmldng 
with  that  scenario.  Solving  a  scenario  means  that  you  feel  you  have  obtained  from  ATIS 
the  flight  information  specified  in  the  scenario. 

If  after  saying  a  query  you  decide  that  you  wish  to  abort  that  query  (e.g.,  you  made  a 
false  start  or  you  changed  your  mind  about  what  you  wanted  to  say),  simply  click  tm  the 
Abort  Query  button. 

•  End  Sceiuurio:  To  end  a  scenario,  click  the  End  Scenario  button  on  the  ATIS  screen. 
Now,  you  can  begin  another  sceruuio  by  following  the  above  procedure. 


3,2  ATIS  Screen 


Figure  1  shows  die  basic  ATIS  screen,  which  you  will  use  to  control  the  session  and 
the  system  uses  to  communicate  with  you.  The  ATIS  screen  has  two  main  displays:  The 
top  level  menu  and  the  cards  display. 
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E]  AXIS 

Begin  Scenario 

New  Query 

Restack 

Abort  Query 

Ready 

_ ^ _ 

System 

Status 


Top-level 
Menu 


Figure  1.  AXIS  Screen 


Begin  Scenario 


New  Query 


Restack 


Abort  Query 


-  Pop  the  scenarios  menu 

•  Start  recording  a  new  query 

•  Return  all  cards  to  their  original  position 

•  Abort  the  query 


[pl  Clone  Button: 


-  Clone  (copy)  a  card  (didc  with  left  mouse  button  and 
drag  to  desired  spot  on  the  screen) 


O  Push/Pop  Button:  -  Pop  a  hidden  card  to  the  top  of  the  card  stadc;  push  a  card  back 

to  its  original  position. 
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•  Top  Level  Menu:  This  menu  is  shown  in  the  top-left  part  of  the  ATIS  screen.  It  contains 
a  number  of  buttons  and  status  windows.  We  have  already  described  how  several  of  the 
buttons  and  status  windows  are  used. 

•  Cards  Display:  The  cards  display  contains  your  queries  and  their  answers.  Each  card 
corresponds  to  a  single  query.  The  top  line  of  each  card  shows  the  card  number  and  what 
the  system  thought  you  said.  In  the  main  body  of  the  card,  the  system  displays  die  answer 
to  the  query.  As  you  say  the  different  queries,  the  cards  iqipear  automatically  and  stack 
themselves  in  the  lower  right  part  of  the  screen.  The  stacking  is  done  such  that  die  old 
queries  are  visible  but  not  their  answers;  only  the  most  recent  card  is  fully  visible.  After 
about  ten  cards  have  been  created,  the  oldest  one  will  be  deleted  automatically  whenever 
a  new  one  is  created. 

To  the  left  of  each  card  is  a  scroll  bar  that  allows  you  to  scroll  that  card  vertically  in 
case  the  answer  occupies  more  space  than  allotted. 

You  will  often  find  the  need  to  locd:  back  at  the  answers  to  some  of  the  previous  queries. 
The  answers  to  older  queries  can  be  seen  by  simply  clicking  m  the  pudi/pop  button  in 
the  top-left  comer  of  the  card  of  interest  Clicking  that  button  will  bring  forward  (pop) 
that  card  and  you  will  be  able  to  see  the  whole  card.  Clicking  the  same  button  again  will 
push  that  card  back  where  it  came  from.  You  can  pop  a  number  of  cards  in  that  manner. 
To  avoid  having  to  push  them  all  back  by  clicking  the  push/pop  button,  simply  click  on 
the  Restack  button  in  the  top-level  menu. 

Note:  The  system  only  remembers  the  ten  most  recent  queries.  Cards  that  disappear  from 
the  screen  can  never  be  retrieved. 


3  J  Clone  Cards 


In  addition  to  being  able  to  look  back  at  previous  queries  and  their  answers,  it  is  often 
convenient  to  be  able  to  make  a  copy  of  a  card  for  later  reference.  This  is  particularly 
important  if  you  suspect  duit  the  card  of  interest  ctmesponds  to  an  old  quay  fliat  soon  may 
disappear  firom  the  screen. 

To  make  a  copy  of  a  card,  you  clone  it  To  do  that  click  the  done  buttcm  in  the 
top-left  part  of  the  card  and  drag  the  card-si»leton  that  appears  to  the  desired  spot  on  the 
screen,  Aen  click  the  mouse  again.  Figure  2  shows  a  card  that  has  been  cloned.  A  clone 
card  looks  the  same  as  a  card  in  the  Cards  Display  except  that  it  has  an  additional  t(^  bar 
and  different  buttons.  Unlike  the  cards  in  the  C^ar^  Display,  the  clone  card  can  be  resized 
using  the  Resize  button  in  the  top-right  comer  of  the  card. 
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Iconize  Clone  button  Clone  button 


Resize  button 


002  What  is  the  name  of  the  airport  in  Boston? 


AIRPORT 

NAME 


Logan  International 


Clone 
'  Card 


^CE]|  000  This  is  the  text  of  the  first  query 


o  CQ  001  This  is  the  text  of  the  second  query 


OjC^I  002  What  is  the  name  of  the  airport  in  Boston? 


AIRPORT 

NAME 


Logan  International 


Figure!.  Clone  Card 

m  Resize  Button:  •  Resize  card  (e^  make  it  wider  to  see  more  of  the  answer). 


m  Iconize  Clone  Button:  • 


Iconize  a  cl(me  card;  to  reopoi  the  icon  elide  on  the 
appropriate  icon  (e.g.,  card  1) 


Delete  Clone  Button:  -  Delete  clone  card 


Scroll  Bar:  -  Scroll  card  up/down  (e.g.,to  see  other  parts  of  the  answer) 
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The  clone  card  can  be  deleted  from  the  screen  by  clicking  the  Del^e  Clone  button  in 
the  top-left  comer  of  the  card.  A  clone  of  the  same  card  can  be  made  again  from  the  Cards 
Display,  but  only  if  that  card  is  still  visible  in  the  Cards  Display.  To  make  the  clone  card 
disappear  teniporarily,  click  on  the  Iconize  Clone  butum  in  the  left  part  of  the  query  line 
of  the  clone  card.  The  clone  card  will  disappear  but  an  icon  corresponding  to  that  card 
will  appear  in  the  top  right  part  of  the  screen  with  the  name  of  die  card  in  it.  The  clone 
card  can  be  made  visible  again  by  clicking  in  die  icon  for  that  card. 

You  may  clcme  as  many  cards  as  you  like,  within  the  limits  of  the  size  of  the  screen. 


3.4  Messages  from  the  system 


If  the  system  is  unable  to  understand  or  answer  your  request,  you  will  receive  a  brief 
message  on  the  screen.  Read  the  message  and  click  in  the  box  under  it  that  says  ”OK” 
when  you  are  ready  to  continue. 


4.  Tips  for  Success 


Remember,  this  is  an  experimental  system.  It  can  be  quite  slow  at  times  and  it  may 
not  be  able  to  answer  all  your  queries.  Please  be  patimt  Here  are  some  tips  on  how  to 
make  the  session  go  smoothly. 

Be  specific.  The  system  tends  to  take  what  you  say  quite  literally. 

DonH  be  long  winded.  Many  short  queries  generally  work  better  than  few  long  ones. 

DonU  pause.  Short  pauses  are  OK,  but  if  you  pause  too  long  in  the  middle  of  a  question, 
the  system  will  think  that  you  have  finisted  speaking,  and  will  stt^  listening  to  you. 

Think  before  you  talk.  Compose  your  query,  press  die  New  Query  button  tm  the  screen, 
and  begin  talking  soon  after  the  system  status  changes  from  ”ready”  to  ’’listening”. 

Be  natural.  Don’t  try  to  speak  ’’computerese”,  just  ask  the  questions  in  ordinary  English 
and  using  your  natural  voice. 

Refer  to  notes.  You  might  want  to  keep  this  document  tum«l  to  the  page  titled  ”What 
kind  of  information  is  in  the  ATIS  System?”  to  see  the  kinds  of  things  you  can  ask  about 
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Ask  it!  If  you’re  not  sure  whether  to  ask  a  particular  questim  or  not,  go  ahead  and  give 
it  a  try! 

Don’t  get  ova*whelined  when  receiving  large  amounts  of  information  frcrni  ATIS.  When 
ATIS  responds  with  an  answer  that  contains  too  much  infcnmation,  you  may  want  to  ask 
additional  queries  that  will  constrain  that  information. 


11.5  A  Sample  BBN  Session 


As  an  example  of  the  types  of  interactions  users  had  with  our  ATIS  system,  we  include 
here  the  record  of  a  user’s  interaction  with  the  system  to  solve  a  particular  scenario.  We 
include  the  natural  language  queries  alone  in  Section  1 1.5.2. 


11.5.1  The  Session 


tRDBVersion;]  rdb3 

[Timestamp:  BBN  Started  log  at  16K)1:Q5] 

[Speaker]  e3 
[Date:]  050691 
[Scenaiioiype:]  BBNA4 

[Begin  ScenaiioDescr:] 

A4:  You  live  in  city  A.  You  want  to  comlxne  a  business  trip 
to  city  B  with  pleasure  by  taking  your  spouse  aloig.  You  will 
receive  a  fixed  travel  allowance,  based  on  a  first-class 
ticket  Plan  a  trip  duit  will  allow  you  to  stretch  die  travel 
allowance  to  cover  the  expenses  for  both  you  and  your  spouse. 

[End  ScenarioDescr] 

[UtterancelD:]  001 

[Timestamp:  Sent  qpeedi  for  utterance  001  at  16^1:52] 
[Timestamp:  BBN  sent  transcription  for  utterance  001  at  16:02K)S] 

[Begin  Utterance:  001] 

WHICH  AIRLINES  FLY  FROM  BOSTON  TO  SAN  FRANQSCO 
[End  Utterance:  001] 

[Timestamp:  Salt  answer  for  utterance  001  at  16:02:28] 
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[Begin  Result:  001] 


AIR 

LXRS 

AIRLXMX 

RAUB 

AIRLINE  *  FLIGHT  # 

FROM 

TO 

DEPT 

TIME 

ARRV 

TIME 

US 

USAZR 

US81/U837 

BOS 

SFO 

12:10 

16:52 

UA 

UNITED  AXRLXNBS 

UA343 

BOS 

SFO 

08:40 

13:36 

UA 

UNITED  AIRLINES 

UA343/UA551 

BOS 

QAE 

08:40 

13:39 

DL 

DELTA  AIR  LINES, 

INC. 

DL975/DLS9 

BOS 

SFO 

15:20 

21:15 

UA 

UNITED  AIRLINES 

UA201/UA343 

BOS 

SFO 

08:38 

13:36 

UA 

UNITED  AIRLINES 

UA281/UA673 

BOS 

QAE 

17:20 

22:05 

DL 

DELTA  AIR  LINES, 

INC. 

DL831/DL149 

BOS 

SFO 

08:05 

13:25 

UA 

UNITED  AIRLINES 

UA281/UA297 

BOS 

SFO 

17:20 

22:14 

DL 

DELTA  AIR  LINES, 

INC. 

DL487/DL395 

BOS 

SFO 

18:45 

23:50 

UA 

UNITED  AIRLINES 

UA201/UA551 

BOS 

QAE 

08:38 

13:39 

UA 

UNITED  AIRLINES 

UA93 

BOS 

SFO 

17 : 45 

21:27 

TW 

TRANS  NORIJ)  AIRLINE.  .  . 

TN61 

BOS 

SFO 

18:10 

21:37 

CO 

CONTINENTAL  AIRLINBS 

00178S 

BOS 

SFO 

17:40 

23:10 

AA 

AMERICAN  AIRLINES 

AA813 

BOS 

SFO 

11:49 

17:58 

DL 

DELTA  AIR  LINES, 

INC. 

DL169/DL887 

BOS 

SFO 

11:32 

16:50 

UA 

UNITED  AIRLINES 

UA21 

BOS 

SFO 

08:00 

11:31 

[End  Result  001] 

[UtteiancelD;]  002 

[Timestamp:  Smt  q)eech  for  utterance  002  at  16:02:53] 
[Timestamp:  BBN  sent  transcription  for  utterance  002  at  16:03K)8] 

[Begin  Utterance:  002] 

WHICH  OF  THOSE  FUGHTS  ARE  FIRST  CLASS  SERVICE 
[End  Utterance:  002] 

[Timestamp:  Sent  answer  for  utterance  002  at  16^)3:58] 

[Begin  Result:  002] 


AZBLZHB  *  nXCBT  *  FROM  TO  DIPT  ABRV  A/C  MBAX.8  FLXGR  FMtB 

TZNB  TZm  CODES  DATS  CODS 


UA21 

BOS 

SFO 

08:00 

11:31 

DIO 

B 

DAILY 

UA343/UA551 

BOS 

OAE 

08:40 

13:39  08S/727 

BL 

SO 

DL975/DL99 

BOS 

SFO 

15:20 

21:15 

757/767 

SD 

DAILY 
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0X93 

BOS 

SFO 

17:45 

21:27  D8S 

D 

DXZLY 

F 

DL831/DL149 

BOS 

SFO 

08:05 

13:25  757/767 

BL 

DAZLY 

F 

0X201/0X551 

BOS 

OXK 

08:38 

13:39  D8S/727 

BL 

HOT  SU 

F 

C01765 

BOS 

SFO 

17:40 

23:10  1180 

D 

HOT  SX 

F 

0X281/0X297 

BOS 

SFO 

17:20 

22:14  D8S/72S 

D 

DXZLY 

F 

0X343 

BOS 

SFO 

08:40 

13:36  D8S 

BL 

SU 

F 

DL487/DL395 

BOS 

SFO 

18:45 

23:50  72S/757 

OS 

nXZLY 

F 

0X281/0X673 

BOS 

OXK 

17:20 

22:05  D8S/733 

0 

DXZLY 

F 

TW61 

BOS 

SFO 

18:10 

21:37  LIO 

0 

DXZLY 

F 

OS81/OS37 

BOS 

SFO 

12:10 

16:52  M80/733 

SL 

DXZLY 

F 

0X201/0X343 

BOS 

SFO 

08:38 

13:36  D8S/D8S 

BL 

HOT  SU 

F 

XX813 

BOS 

SFO 

11:49 

17:58  767 

XO 

DXZLY 

F 

DL169/DL887 

BOS 

SFO 

11:32 

16:50  757/757 

LS 

DXZLY 

F 

[End  Result:  002] 

[UtterancelD:]  003 

[Timestamp:  Sent  speech  for  utterance  003  at  16:05:22] 
[Timestamp:  BBN  sent  transcriptitm  for  utterance  003  at  16:06K)7] 

[Begin  Utterance:  003] 

LIST  THOSE  FLIGHTS  LEAVING  BEFORE  9  AM  DAILY 
[End  Utterance:  003] 

[Tunestamp:  Sent  answer  for  utterance  003  at  16:07:40] 

[Begin  Result:  003] 


XZRLZNE  +  FLZGBT  # 

FROM 

TO 

DEPT 

TZMB 

XRBV  X/C 

TZMB  COOES 

MBXLS 

FLZGBT 

DXYS 

UX21 

BOS 

SFO 

08:00 

11:31  DIO 

B 

DXZLY 

UX343/UX551 

BOS 

OXK 

08:40 

13:39  D8S/727 

BL 

SU 

DL831/OL149 

BOS 

SFO 

08:05 

13:25  757/767 

BL 

DXZLY 

UX201/UX551 

BOS 

OXK 

08:38 

13:39  D8S/727 

BL 

HOT  SU 

UX343 

BOS 

SFO 

08:40 

13:36  D8S 

BL 

SU 

UX201/UX343 

BOS 

SFO 

08:38 

13:36  D8S/D88 

BL 

HOT  SU 

[End  Result:  003] 

[UtterancelD:]  004 

[Timestamp:  Sent  speech  for  utterance  004  at  16K)8:26] 
[Timestamp:  BBN  sent  transcription  for  utterance  004  at  16:08:41] 
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[Begin  Utterance:  004] 

WfflCH  OF  THOSE  FUGHTS  ARE  NONSTOP 
[&k1  Utterance:  004] 

[Timestamp:  Sem  answer  for  utterance  004  at  16:09:13] 
[Begin  Result:  004] 


AZRLZIIB  *  rLZGBT  #  TRGM  TO  DSPT  AltSV  X/C 

TZMS  TXMB  COOES 


MEALS  ELZGBT 
DAYS 


0X21 


BOS  SFO  08:00  11:31  DIO  B  DAILY 


[End  Result:  004] 

[UtteranoelD:]  005 

[Timestamp:  Sent  speech  for  utterance  OQS  at  16:10:12] 

[Hmestamp:  BBN  sem  transcriptian  for  utterance  OOS  at  16:10:28] 

[Begin  Utterance:  OOS] 

UST  ALL  FUGHTS  ON  UNITED  FROM  SAN  FRANCISCX)  TO  BOSTON 
[End  Utterance:  005] 

[Timestamp:  Sem  answer  for  utterance  OOS  at  16:10:36] 

[Begin  Result:  OOS] 


AZBLIHB  -I-  FLZGBT  # 

ntOM  TO 

DEPT 

TZIB 

XBBV  X/C 

TIME  CODES 

MEALS 

rLZGBT 

DAYS 

0X20 

sro 

BOS 

13:40 

22:02  DIO 

L 

DXZLY 

0X92 

sro 

BOS 

08:00 

16:27  D8S 

B 

DXZLY 

0X194/0X352 

sro 

BOS 

06:20 

16:19  D10/D8S 

BL 

DXZLY 

0X982/0X352 

OAK 

BOS 

06:25 

16:19  733/D8S 

BL 

DXZLY 

0X354 

OXK 

BOS 

11:05 

20:56  72S 

LD 

DXZLY 

0X820/0X354 

sro 

BOS 

11:08 

20:56  D10/72S 

LD 

DXZLY 

[End  Result:  005] 

[UtterancelD:]  006 

[Timestamp:  Sem  speedi  for  utterance  006  at  16:11:34] 
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[Timestamp:  BBN  sem  transcriptioi  for  utterance  006  at  16:11:59] 

[Begin  Utterance:  006] 

WHAT  IS  THE  ROUND-TRIP  FIRST  CLASS  FARE  ON  UNITED  FROM  BOSTON  TO 

SANFRANC3SCO 

[End  Utterance:  006] 

[Timestamp:  Sent  answer  for  utterance  006  at  16:12:49] 

[Begin  Result:  006] 


FARE  ID  OMB-RAY  BOOMD  FARE  RESYRICTXOM 

COST  TRIP  CODE 


7100570  $836.00  $1,672.00  F 
7100132  $860.00  $1,720.00  F 


[End  Result’  006] 

[UtterancelD:]  007 

[Timestamp:  Sent  speech  for  utterance  007  at  16:13:42] 

[Timestamp:  BBN  sent  transcription  for  utterance  007  at  16;14K)1] 

[Begin  Utterance:  007] 

WHAT  IS  THE  ROUND-TRIP  THRIFT  FARE  ON  U  S  AIR  FROM  BOSTON  TO 
SAN  FRANCISCO 
[End  Utterance:  007] 

[Timestamp:  Sem  answer  for  utterance  007  at  16:16:24] 

[Begin  Re^t:  007] 


FARE  ID  ORE-WAY  RODND  FARE  RESTRICTION 

COST  TRIP  CODE 


7100137  $573.00  $1,146.00  Y 


[End  Result  007] 

[UtterancelD:]  008 

[Timestamp:  Sem  qieech  for  utterance  008  at  16:17:00] 
[Timestamp:  BBN  sem  transcription  for  utterance  008  at  16:17:13] 
[Begin  Uttmance:  008] 

ON  UNITED 
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[End  UttemHx:  008] 

[Timestamp:  Sem  answer  for  utterance  008  at  16:18:27] 
[Begin  Re^:  008] 


FAKE  ZD  ONB-lAy  ItOOMD  FAKE  RESZRZCTZOH 

COST  TRIP  CODE 


7100142  $189.00  $378.00  QK  AF/80 
7100567  $189.00  $378.00  QpC  AF/80 


[End  Result'  008] 

[UtterancelD:]  009 

[Timestamp:  Sent  speedi  uttennce  009  at  16:19:08] 
[Hmestamp:  BBN  sem  transcription  for  utterance  009  at  16:19:22] 
[Begin  Utterance:  009] 

ON  DELTA 
[End  Utterance:  009] 

[Timestamp:  Sem  answer  for  utterance  009  at  16:19:47] 

[Begin  Result:  009] 


FABE  ZD  ONE-WAY  EOOHD  FARE  EESTRZCTZON 

COST  TRZF  CODE 


7100137  $573.00  $1,146.00  Y 


[End  Result  009] 

[UtterancelD:]  OOA 

[Timestamp:  Sem  speech  for  utterance  OOA  at  1620:04] 
[Timestamp:  BBN  sem  transcription  for  utterance  OOA  at  16:20:11] 
[Begin  Utterance:  OOA] 

ON  TWA 

[End  Utterance:  OOA] 

[Timestamp:  Sem  answer  for  utterance  OOA  at  1620:42] 

[Begin  Re^:  OOA] 
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nutz  ZD  oa-my  soond  nss  sssntzcTZOir 

COST  TRIP  C(»B 


7100142  $189.00  $378.00  QX  AP/80 


[End  Result:  OOA] 

[UtterancelD:]  OOB 

rnmestamp:  Sent  speech  for  utterance  OOB  at  16:21:00] 
rrimestamp:  BBN  sent  transcriptitm  for  utterance  OOB  at  16:21:09] 
[Begin  Utterance:  OOB] 

ON  AMERICAN  AIRLINES 
[End  Utterance:  OOB] 

[Hmestamp:  Sent  answer  for  utterance  OOB  at  16:21:48] 

[Begin  Result:  OOB] 


TABB  ZD  OMB-my  ROOMD  BABB  BBSTBZCTZOV 

COST  TBZP  OODB 


7100138  $508.00  $1,016.00  B 


[End  Result:  OOB] 

[UtterancelD:]  OOC 

rnmestamp:  Sent  speech  for  utterance  OOC  at  16:22:03] 
[Timestamp:  BBN  sent  transcription  for  utterance  OOC  at  16:22:17] 
[Begin  Utterance:  OOC] 

EXPLAIN  RESTRICTION  AP80 
[End  Utterance:  OOC] 

[Timestamp:  Sent  answer  Ah’  utteianoe  OOC  at  16:22:35] 

[Begin  Result:  OOC] 


BBSTBZCTZON  APPLZCATZQII 


AP/80  PARES  APPLE  TO/PBON  ZHTBBMEDZATB  POZMTS 


[End  Result:  OOC] 
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[UtterancelD:]  OOD 

rnmestamp:  Sent  ^)eech  for  utterance  OOD  at  16:23:26] 

[Tunestamp:  BBN  sent  transcription  for  utterance  OOD  at  16:23:43] 
[Begin  Utterance:  OOD] 

LIST  ALL  UNITED  FLIGHTS  FROM  BOSTON  TO  SAN  FRANCISCO 
[End  Utteratx:e:  OOD] 

[Timestamp:  Sent  answer  for  utterance  OOD  at  16:23:50] 

[Begin  Result:  OOD] 


AZBI.ZNB  -f  FLIGHT  #  FROM  TO  DEFT  RRRV  A/C  MEALS  FLIGHT 


TZME 

TIME 

CODES 

DATS 

XJA21 

BOS 

SFO 

08:00 

11:31 

DIO 

B 

DAILY 

nA343/DASSl 

BOS 

OAK 

08:40 

13:39 

D8S/727 

BL 

8D 

aA93 

BOS 

SFO 

17:45 

21:27 

D8S 

D 

DAILY 

nA201/DA551 

BOS 

OAK 

08:38 

13:39 

D8S/727 

BL 

HOT  SU 

DA281/DA297 

BOS 

SFO 

17:20 

22:14 

D8S/72S 

D 

DAILY 

aA343 

BOS 

SFO 

08:40 

13:38 

D8S 

BL 

SD 

aA281/QA673 

BOS 

OAK 

17:20 

22:05 

D8S/733 

D 

DAILY 

OA201/IIA343 

BOS 

SFO 

08:38 

13:36 

D8S/D8S 

BL 

EOT  SD 

[End  Result:  OOD] 

[UtterancelD:]  OQE 

[Timestamp:  Sem  qreech  for  utterance  0(£  at  16:25:01] 
[Timestamp:  BBN  sent  transcription  for  utterance  OOE  at  16:25:21] 
[Begin  Utterance:  0(£] 

EXPLAIN  FARE  ID 
[End  Utterance:  0(£] 

[Timestamp:  BBN  Sent  query  failure  fin*  utterance  OOE  at  16*26:41] 
[Begin  etior  OGE] 

Sorry,  foe  system  was  unable  to  answer  your  question. 

[End  Error  OQE] 

[UtterancelD:]  OOF 

[Timestamp:  Sent  speedi  for  utterance  OOF  at  162729] 
[Timestamp:  BBN  sent  transcripticm  for  utterance  OOF  at  16:27:49] 
[Begin  Utterance:  OOF] 

EXPLAIN  AJC  CODES 
[End  Utterance:  OOF] 
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rnmestamp;  Sent  answer  for  utterance  OQF  at  16:28:20] 

[Begin  Result:  OOF] 


PROPDLSZON  AZBCRArT  AXBCRAFT  DBSCRZPTZOH 


TDBBQPROP 

DBS 

BOBZBG  CMtRDX  DBC. .  . 

JBT 

727 

B0BZB6  727  PASSBN. .  . 

JBT 

72S 

BOBZBG  727-200  (X. .  . 

JBT 

73S 

BOBZBG  737  PA8SBB. . . 

JBT 

733 

BOBZBG  737-300 

JBT 

734 

BOBZBG  737-400 

JBT 

74M 

BOBZBG  747  MZXBD  ... 

JBT 

757 

BOBZBG  757-200  PX. .  . 

JBT 

767 

BOBZBG  767  (ALL  S... 

JBT 

763 

BOBZBG  767-300/300BR 

TURBOPROP 

J31 

BRZTZSB  RBR08PXCB. . . 

TURBOPROP 

amt 

FXZRCBZU)  (SBBRRZ... 

JBT 

r28 

rOKRBR  P28  IBLLOil. . . 

JBT 

LIO 

LOCBBBBD  LlOll  (X. . . 

JBT 

L15 

LOCBBBBD  LlOll-50... 

JBT 

OSS 

MCDOBBBLL  DOUGLAS. . . 

JBT 

D9S 

MCDONBBLL  DOUGLAS. . . 

JBT 

DIO 

1CDOHBBT.L  DOUGLAS. . . 

JBT 

MSO 

MCDOBBBLL  DOUGLAS. . . 

TURBOPROP 

SB3 

SHORTS  330  PASSSBGBR 

TURBOPROP 

SB6 

SHORTS  360 

JBT 

100 

VOIQCER  XOO 

[End  Result  OOF] 

[UtteranceOD:]  OOG 

rnmestamp:  Sent  qteedi  for  utterance  OOG  at  16:29K)3] 

[Timestamp:  BBN  sent  transcription  for  utterance  OOG  at  16:29:35] 
[Begin  Utterance:  OOG] 

LIST  ALL  UNITED  FLIGHTS  FROM  BOSTON  TO  SAN  FRANCISCO 
WITH  FARE  CODE  QX 
[End  Utterance:  OOG] 

rnmestamp:  Sent  answer  for  utterance  OOG  at  16:29:58] 

[Begin  Re^:  OOG] 
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AXRLIHE  -t-  IXICBT  #  FROM  TO  DEPT  ARRV  A/C  NEALS  PLIGHT 

TIME  TIME  CODES  DAYS 


DA21 

BOS 

SFO 

08:00 

11:31  DIO 

B 

DAILY 

DA93 

BOS 

SFO 

17:45 

21:27  DOS 

D 

DAILY 

UA201/UAS51 

BOS 

OAR 

08:38 

13:39  D8S/727 

BL 

HOT  SU 

UA281/nA297 

BOS 

SFO 

17:20 

22:14  D8S/72S 

D 

DAILY 

DA281/UA673 

BOS 

OAR 

17:20 

22:05  D8S/733 

D 

DAILY 

UA201/nA343 

BOS 

SFO 

08:38 

13:36  D8S/08S 

BL 

HOT  SO 

[End  Result:  OOG] 

[UtterancelD:]  OOH 

[Timestamp:  Suit  ^)eech  for  utterance  OOH  at  16:31:11] 

[Hmestamp:  BBN  sent  transcription  for  utterance  OOH  at  16:31:48] 

[Begin  Utterance:  OOH] 

LIST  ALL  FUGHTS  ON  UNITED  FROM  SAN  FRANQSCO  TO  BOSTON  WITH 
FARE  CODE  QX 
[End  Utterance:  OOH] 

[Timestamp:  Sent  answer  for  utterance  OOH  at  16:32:00] 

[Begin  Result:  OOH] 


AIRLIHE  +  FLIGHT  # 

FROM 

TO 

DEPT 

TIME 

ARRV 

TIME 

A/C 

CODES 

MEALS 

FLIGHT 

DAYS 

UA20 

SFO 

BOS 

13:40 

22:02 

DIO 

L 

DAILY 

DA92 

SFO 

BOS 

08:00 

16:27 

D8S 

B 

DAILY 

UA194/UA3S2 

SFO 

BOS 

06:20 

16:19  D10/D8S 

BL 

DAILY 

UA982/X7A352 

OAR 

BOS 

06:25 

16:19 

733/D8S 

BL 

DAILY 

UA354 

OAR 

BOS 

11:05 

20:56 

72S 

LD 

DAILY 

[End  Result:  OOH] 

[UtterancelD:]  001 

[Tmestamp;  Sem  speech  for  utterance  001  at  16:33:14] 

[Timestamp:  BBN  sent  transcription  for  utterance  001  at  16:33:29] 
[Begin  Utterance:  001] 

LIST  THE  NUMBER  OF  STOPS  FOR  EACH  OF  THOSE  FLIGHTS 
[End  Utterance:  001] 

[Timestamp:  Sent  answer  for  utterance  001  at  16:33:45] 
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[Begin  Result:  001] 


STOPS 

AZRLIMB  -I-  FLIGHT  # 

FROM 

TO 

DEPT 

TZME 

ABBV 

TIME 

A/C 

COOES 

MEALS 

FLIGHT 

DAYS 

1 

DA194/DA352 

SFO 

BOS 

06:20 

16:19 

D10/D8S 

BL 

DAILY 

0 

UA92 

SFO 

BOS 

08:00 

16:27 

D8S 

B 

DAILY 

1 

t]A982/Ua352 

OAK 

BOS 

06:25 

16:19 

733 /D8S 

BL 

DAILY 

1 

021354 

OAK 

BOS 

11:05 

20:56 

72S 

ID 

D2aLY 

0 

1X2120 

SFO 

BOS 

13:40 

22:02 

DIO 

L 

DAILY 

[End  Result:  001] 

[Tunestamp:  BBN  End  log  at  16:35:48] 


11.5.2  The  Utterances 


Here  ate  the  utterances  from  the  preceding  sample  session,  in  lexicalized  SNOR  format 
Each  utterance  is  followed  by  its  utterance  ID  in  the  ATIS2  corpus. 


WHICH  AIRLINES  FLY  FROM  BOSTON  TO  SAN  FRANQSCO  (eSOOllsx) 

WHICH  OF  THOSE  FUGHTS  HAVE  FIRST  CLASS  SERVICE  (c30021sx) 

UST  THOSE  FLIGHTS  LEAVING  BEFORE  NINE  A  M  DAILY  (e3(X)31sx) 

WHICH  OF  THOSE  FUGHTS  ARE  NONSTOP  (e30041sx) 

UST  ALL  FUGHTS  ON  UNITED  FROM  SAN  FRANQSCO  TO  BOSTON  (e30051sx) 
WHAT  IS  THE  ROUND  TRIP  FIRST  CLASS  FARE  ON  UNITED  FROM  BOSTON 
TO  SAN  FRANQSCO  (c30061sx) 

WHAT  IS  THE  ROUND  TRIP  THRIFT  FARE  ON  U  S  AIR  FROM  BOSTON 
TO  SAN  FRANQSCO  (c30071sx) 

ON  UNITED  (e30081sx) 

ON  DELTA  (e30091sx) 

ON  T  W  A  (e300alsx) 

ON  AMERICAN  AIRLINES  (e300blsx) 

EXPLAIN  RESTRICTION  A  P  EIGHTY  (c300clsx) 

UST  ALL  UNITED  FUGHTS  FROM  BOSTON  TO  SAN  FRANCISCO  (e300dlsx) 
EXPLAIN  FARE  I  D  (e300elsx) 

EXPLAIN  A  C  CODES  (e300flsx) 
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UST  ALL  UNITED  FUGHTS  FROM  BOSTON  TO  SAN  FRANQSCO 
WITH  FARE  CODE  Q  X  (eSOOglsx) 

UST  ALL  FUGHTS  ON  UNITED  FROM  SAN  FRANQSCO  TO  BOSTON 
WITH  FARE  CODE  Q  X  (eSOOhlsx) 

UST  THE  NUMBER  OF  STOPS  FOR  EACH  OF  THOSE  FUGHTS  (eSOOilsx) 


11.6  Final  Summary  of  The  HARC  System 


We  conclude  this  chapter  with  the  descriptions  of  DELPHI  and  B  YBLOS  and  of  the  overall 
HARC  system  that  were  submitted  to  NIST  as  pan  of  our  participation  in  die  February, 
1992  Cross-Site  Evaluations  of  speech  recognitim,  natural  Umguage,  and  spdcen  language 
systems.  We  hope  this  will  give  a  good  overview  of  the  state  of  our  systems  at  the  end  of 
this  project 


11.6.1  DELPHI 


1)  OVERALL  SYSTEM  DESCRIPTION:  The  BBN  DELPHI  natural  language  processing 
system  uses  a  grammar  based  on  the  definite  clause  grammar  formalism,  extended  with  type 
declarations,  which  restrict  the  range  of  the  possible  valitts  of  argument  and  a  limited  form 
of  disjunction.  It  is  parsed  using  a  variant  of  the  Graham-Hatiison-Russo  algorithm,  which 
has  been  modified  to  use  a  trainable  agenda  to  produce  a  best  first  parse,  rather  dian  all 
possible  parses.  While  “unification  grammars”  normally  build  up  semantic  representations 
in  tandem  with  syntactic  structure,  our  system  performs  mly  semantic  type  consistency 
checking  on  the  fly,  and  attempts  to  form  WFFs  (mly  when  all  the  necessary  semantic 
elements  have  been  found  (typically,  at  the  clausal  level).  A  fragment  processor  produces 
a  semantic  rqnesentaticm  in  those  cases  in  which  the  parser  cannot  return  a  parse.  The 
discourse  cmnponent  of  DELPHI  uses  an  extensitm  of  Bonnie  Webber’s  discourse  entities 
model  to  determine  antecedents  of  pronouns  and  other  atuq)horic  erqnessions.  A  domain- 
specific  plan  tracker  is  used  to  keep  track  of  tiie  state  oi  the  discourse  and  to  record 
ctmtextu^  informatioa  tiuit  may  be  in  effect  for  more  tiian  one  query  (e.g.  in  the  case  of 
ATIS:  origin,  destination,  flight  day,  etc.)  The  semantic  rqrresentatitm  is  in  a  version  of 
MRL,  a  Mearung  Representation  Language  first  used  in  LUNAR.  This  is  translated  into 
ERL  for  database  retrievals. 

2)  TRAINING  DATA:  For  the  February  1992  evaluation  test,  the  main  training  data  used 
were  the  NL  utterances  and  annotations  in  the  third  through  eighth  shipments  of  aruiotated 
data  for  atis2  (answer.910830  through  answer .91 1030;  1933  annotated  utterances  and  2555 
total  utterances).  At  first,  we  used  the  runth  through  thirteenth  shipments  of  aruiotated 
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data  (answer.911108  through  answer.911217)  as  development  test  Eventually,  we  added 
all  but  answer.911217  to  a  second  training  set  (1405  annotated  utterances  and  1746  total 
utterances),  retaining  answer.911217  (365  annotated  utterances  and  455  total  utterances) 
and  adding  the  October  1991  dry  run  material  (288  annotated  utterances  and  361  total 
utterances)  as  development  test  Finally,  these  two  small  coipora  were  added  to  the  training 
as  well.  The  grammar  and  lexicon  had  previously  been  trained  on  the  class  A  sentences 
from  the  atisO  data  release,  the  class  A  and  D1  utterances  from  the  atis2  release  collected 
by  TI,  and  the  first  two  shipments  of  annotated  cross-site  data  for  atis2.  The  later  atis2 
data  releases  were  used  as  incremental  additions  to  this. 

3)  LEXICX)N  DESCRIPTION:  The  DELPHI  system  for  ATIS  uses  multiple  lexicons:  a 
domain  independent  core  lexicon  of  a  little  over  300  entries  Goaded  in  for  any  domain); 
a  domain  independent  lexicon  of  approximately  50  common  terms  for  database  lookup 
applications  Goaded  in  for  any  database  application);  and  a  domain  specific  lexicon  of  nearly 
2000  lexical  entries  for  ATIS.  The  DELPHI  lexicons  contain  collocations  as  independent 
lexical  entries;  these  are  processed  by  DELPHI  as  if  they  were  entered  direcfly  as  rules 
in  the  granunar.  Inflected  forms  do  not  have  separate  entries,  but  are  either  derived  by 
morphology  in  the  case  of  regular  lexical  items,  or  are  entered  as  part  of  the  lexical  entry 
for  the  base  form,  for  irregularly  inflected  lexical  items. 

4)  NEW  CXlNDmONS  FOR  THIS  EVALUATION:  The  DELPHI  system  for  the  February 
1992  evaluation  was  different  from  the  system  used  for  the  October  1991  dry  run  in  several 
ways.  The  use  of  labelled  arguments  in  grammar  rules,  previously  reported  in  our  work  on 
mapping  unit  subcategorization,  was  extended  to  the  grammar  as  a  whole.  The  firagment 
combiner  was  extended,  and  many  changes  were  made  to  the  discourse  cmnponent  and 
task  tracker. 

5)  REFERENCES: 

Ayuso,  D.  “Discourse  Entities  in  Janus”,  Proceedings  of  ACL  27,  Association  for  (Compu¬ 
tational  Linguistics,  Murray  Hill,  NJ,  pp.  243-250. 

Bobrow,  R.  “Statistical  Agenda  Parsing”  Proceedings  Speech  and  Natural  Language  Work¬ 
shop  February  1991,  Morgan  Kaufmarm  Publishes,  Inc.,  San  Mateo,  CA,  1991,  pp.  222- 
224. 

Bobrow,  R.,  Ingria,  R.  and  Stallard,  D.  “Syntactic  and  Semantic  Knowledge  in  the  DELPHI 
Unification  Grammar”.  Proceedings  Speech  and  Natural  Language  Workshop  June  1990, 
Morgan  Kaufmarm  Publishers,  Inc.,  San  Mateo,  CA,  1990,  pp.  230-236. 

Bobrow,  R.,  Ingria,  R.  and  Stallard,  D.  “The  Mapping  Unit  Approach  to  Subcategorization”. 
Speech  and  Natural  Language:  Proceedings  of  a  Workshop  Held  at  Pacific  Grove,  Califor- 
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nia,  February  19-22,  1991,  Morgan  Kaufmann  Publishers,  Inc.,  San  Mateo,  California, 
1991,  pp.  185-189. 


11.6^  BYBLOS 


1)  OVERALL  SYSTEM  DESCRIPTION: 

The  Feb  92  BBN  BYBLOS  system  used  45  spectral  features  to  model  speech:  14  cepstia 
and  their  1st  and  2nd  diffoences,  plus  normalized  energy  and  its  1st  ai^  2nd  differntce. 
The  HMM  observation  densities  were  modeled  as  tied  Gaussian  mixtures  defined  by  K- 
means  clustering  and  fixed  during  training.  Context-dependent  phonetic  HMMs  (including 
those  occurring  across  word  boundaries)  were  constructed  frtm  triphones,  left-diphtmes, 
right-diphones,  and  context-independent  phonemes  that  were  trained  jointly  in  forward- 
backward.  Phoneme-dependent  cooccurrence  smoothing  matrices  were  estimated  firom  tiK 
trained  triphone  context  models  and  applied  to  all  context-dependent  observatitm  densities. 

The  BYBLOS  recognizer  used  a  time-synchronous,  3-pass  algoritiim  diat  utilizes  progres¬ 
sively  more  detailed  models  in  an  efficient  marmer  [1].  The  1st  pass  proceeds  forward  (in 
time)  through  the  data  saving  ending  times  and  scores  for  the  top-scoring  partial  word- 
sequences.  The  2nd  pass  proceeds  backward  computing  the  tq[>  N  best  wt^-sequences, 
taking  advantage  of  the  wonl-sequence-ending  scores  from  the  first  pass  to  prune  aggres¬ 
sively  based  on  the  score  for  the  entire  hypothesis  under  consideration.  In  the  3rd  (for¬ 
ward)  pass,  the  N-best  hypotheses  are  reontered  using  the  system’s  most  detailed  models, 
including  cross-word-boundary  triphone  HMMs  and  hitter  mder  ngram  statistical  language 
models.  The  top  scoring  answer  after  the  rescoring  pass  constitutes  the  speech  recognition 
result. 

2)  ACOUSTIC  TRAINING: 

80(X)  spontaneously  produced  utterances  were  used  for  acoustic  training.  All  were  from 
ATIS2  MADCOW  data.  The  training  was  partitioned  in  order  to  train  gender-dependent 
models.  800  additional  utterances  were  held-out  for  development  testing. 

3)  GRAMMAR  TRAINING: 

A  bigram  class  granunar  was  used  for  the  Nbest  The  rescoring  pass  used  a  trigram 
language  model  The  granunars  included  1050  word-classes  and  were  smoodied  witii 
backoff  probabilities.  They  were  trained  on  the  145(X)  ATIS  sentences  in  die  A77S0, 
ATISI,  and  ATIS2  subcoip^  The  bigram  perplexity  was  14  cm  training  and  18  on  die 
development  test,  while  die  digram  perplexity  was  7  on  the  training,  13  on  the  developmoit 
test. 
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4)  RECXXjNmON  LEXICON  DESCRIPnON: 

The  recognition  dictionary  had  1881  words.  Only  7  words  had  multiple  prrmunciadons. 
IS  different  nonspeech  events  were  treated  as  words  in  the  dictionary. 

5)  NEW  CONDITIONS  FOR  THIS  EVALUATION; 

This  is  the  first  time  that  a  large  quantity  of  spontaneous  ATIS  speech  has  been  available 
to  train  the  HMM  and  language  model  parametos.  We  have  used  the  trigram  grammar 
effectively  on  this  task  for  the  first  time. 

6)  REFERENCES: 

[1]  “New  Uses  for  N-Best  Sentence  Hypotheses  >^^thin  tire  BYBLOS  Speech  Recogrutirm 
System”,  ICASSP-92 


11.6J  HARC 


(1)  OVERALL  SYSTEM  DESCRIPTION:  The  SLS  system  used  for  the  February  1992 
evaluation  used  the  BYBLOS  system  for  speech  recognition  and  tire  DELPHI  system  for 
natural  language  processing  (see  individual  descr^tions  of  these  systems  for  full  details). 
DELPHI  processed  the  N-best  lists  produced  by  BYBLOS  in  the  following  manner:  a 
parameter  S,  indicating  the  maximum  depth  of  search  into  the  N>best  list  that  DELPHI 
would  perform,  was  set  These  S  utterances  were  first  processed  by  DELPHI  with  fragment 
processing  disabled  until  a  database  retrieval  was  ma^.  ff  no  database  retrieval  was  made, 
DELPHI  iterated  over  the  same  S  utterance,  with  fragment  processing  enabled.  For  this 
evaluation,  S  was  set  at  5. 

(2)  TRAINING  DATA:  See  individual  descripticms  of  BYBLOS  and  DELPHI  for  tiie  train¬ 
ing  sets  used  for  SPREC  and  NL  separately.  We  developed  our  interface  strategy  on  the 
basis  of  combined  SLS  data  from  the  October  ’91  dry  run  and  a  subset  of  the  NL  training 
data  (approximately  700  sentences).  We  used  a  development  test  set  of  qrproximately  350 
utterances  comprising  the  intersection  of  the  develt^ent  test  sets  for  SPREC  and  NL. 

(3)  NEW  CONDITIONS  FOR  THIS  EVALUATION:  See  individual  descriptions  of  BYB¬ 
LOS  and  DELPHI 

(4)  REFERENCES:  See  individual  descriptions  of  BYBLOS  and  DELPHI 
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of  a  Workshop  Held  at  Pacific  Grove,  California,  February  19-22,  1991,  Morgan 
Kaufmann  Publishers,  Inc.,  San  Mateo,  California,  pp.  306-311. 

Bobrow,  Robert  (1991)  “Statistical  Agenda  Parsing”,  in  Speech  and  Natural  Language: 
Proceedings  of  a  Workshop  Held  at  Pacific  Grove,  California,  February  19-22, 1991, 
Morgan  Kaufmann  Publishers,  Inc.,  San  Mateo,  California,  pp.  222r-224. 
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Bobrow,  R.  and  Lance  Ramshaw  (1990)  **On  Deftly  Introducing  Procedural  Elements  into 
Unification  Parsing”,  in  Speech  and  Natural  Language:  Proceeding  j  of  a  Workshop 
Held  at  Hidden  Valley,  Pennsylvania,  June  24-27, 1990,  Morgan  Kaufmaim  Publish¬ 
ers,  Inc.,  San  Mateo,  California,  pp.  237-240. 

Bobrow,  R.,  Robert  Ingria,  and  David  Stallard  (1990)  “Syntactic  and  Semantic  Knowledge 
in  the  DELPHI  Unification  Gnunmar”,  in  Speech  and  Natural  Language:  Proceed¬ 
ings  of  a  Workshop  Held  at  Hidden  Valley,  Pennsylvania,  June  24-27, 1990,  Morgan 
Kaufmann  Publishers,  Inc.,  San  Mateo,  California,  pp.  230-236. 

Bobrow,  Robert,  Robert  Ingria,  and  David  Stallard  (1991)  “The  Mapping  Unit  Approach 
to  Subcategorization”,  in  Speech  and  Natural  Language:  Proceedings  of  a  Work¬ 
shop  Held  at  Pacific  Grove,  California,  February  19-22,  1991,  Morgan  Kaufmann 
Publishers,  Inc.,  San  Mateo,  California,  i^.  185-189. 

Boisen,  S.  (1989)  The  SLS  Personnel  Database  (Release  1),  SLS  Notes  No.  2,  BBN 
Systems  and  Technologies  Corporation,  1989. 

Boisen,  Sean,  Lance  Ramshaw,  Damaris  Ayuso,  and  Madeleine  Bates  (1989)  “A  Proposal 
for  SLS  Evaluation”,  in  Speech  and  Natural  Language:  Proceedings  of  a  Workshop 
Held  at  Cape  Cod,  Massachusetts,  October  15-18,  1989,  Morgan  Kaufmann,  Pu^ 
Ushers,  Inc.,  San  Mateo,  California,  pp.  135-146 

Chow,  Y-L.  (1990)  “Maximum  Mutual  Information  Estimation  of  HMM  Parameters  for 
(Continuous  Speech  Recognition  Using  the  N-Best  Algorithms”,  IEEE  International 
Conference  on  Acoustics,  Speech,  and  Signal  Processing  in  Albuquerque,  NM,  April 
3-6,  1990. 

Haas,  Andrew  (1989)  “A  Generalization  of  the  Offiine-Parsable  Grammars”,  in  27th  An¬ 
nual  Meeting  of  the  Association  for  Computational  Linguistics:  Proceedings  of  the 
Conference,  Association  for  Computational  Linguistics,  Morristown,  NJ.  pp. 

Harrison,  I^ilip,  Steven  Abney,  Ezra  Black,  Dan  FUckinger,  Claudia  Gdaniec,  Ralph  Gr- 
ishman,  Donald  Hindle,  Robert  Ingria,  Mitch  Marcus,  Beatrice  Santorini,  Ttnnek 
Strzalkowski  (1991)  “Evaluating  Ssmtax  Performance  of  ParserAjrammars  of  En- 
gUsh”,  in  Jeannette  G.  Neal  and  Shartm  M.  Walter,  eds..  Natural  Language  Process¬ 
ing  Systems  Evaluation  Workshop,  University  of  California,  Berkeley,  CA,  18  June 
1991,  pp.  71-77. 

Ingria,  Robert  J.  P.  (1989)  “Simulation  of  Language  Understanding:  Lexical  Recognition”, 
in  Computatiorud  Linguistics:  An  International  Handbook  on  Computer  Oriented 
Language  Research  and  Applications,  Walter  de  Gruyter  •  Berlin  •  New  York,  pp. 
336-347. 

Ingria,  Robert  J.  P.  (1990)  “The  Limits  of  Unification”,  in  28th  Annual  Meeting  of  the  As¬ 
sociation  for  Conputational  Linguistics:  Proceedings  of  the  Corference,  Association 
for  Computational  Linguistics,  Morristown,  NJ.  pp.  194-204. 
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Ingria,  Robert  and  Lance  Ramshaw  (1989)  “Pealing  to  New  Domains  Using  the  Learner^’, 
in  Speech  and  Natural  Language:  Proceedings  of  a  Workshop  Held  at  Cape  Cod, 
Massachusetts,  October  15-18, 1989,  Moigan  Kaufmann,  Publishers,  Inc.,  San  Ma¬ 
teo,  Califennia,  pp.  241-244. 

Ingria,  Robert  and  David  Stallard  (1989)  “A  Computational  Mechanism  for  Proneaninal 
Reference”,  in  27th  Annual  Meeting  of  the  Association  for  Conputational  Linguistics: 
Proceedings  of  the  Conference,  Association  for  Computational  Linguistics,  Menris- 
town,  NJ.  pp.  262-271. 

Kubala,  F.,  S.  Austin,  C.  Barry,  J.  Makhoul,  P.  Placeway,  R.  Schwartz  (1991)  “BYBLOS 
Speech  Recognitiem  Benchmark  Results”,  in  Speech  and  Natural  Language:  Pro¬ 
ceedings  of  a  Workshop  Held  at  Pacific  Grove.  California,  February  19-22,  1991, 
Morgan  Kaufinann  Publishers,  Inc.,  San  Mateo,  California,  pp.  77-82. 

Kubala,  F.  and  R.  Schwartz  (1991)  “A  new  paradigm  for  speaker-independent  training”, 
IEEE  International  Conference  on  Acoustics,  Speech,  and  Signal  Processing  in  Toronto, 
Canada,  May  14-17,  1991. 

Ramshaw,  L.  (1989)  Manual  for  SLS  ERL  (Release  1),  SLS  Notes  No.  3,  BBN  Systems 
and  Technologies  Corporation,  1989. 

Schwartz,  Richard  and  Steve  Austin  (1990)  “Efficient,  High-Performance  Algorithms  for 
N-Best  Search”,  in  Speech  and  Natural  Language:  Proceedings  of  a  Workshop  Held 
at  Hidden  Valley,  Pennsylvania,  June  24-27,  1990,  Moi:gan  Kaufmaim  Publishers, 
Inc.,  San  Mateo,  California,  pp.  6-11. 

Schwartz,  R.  and  S.  Austin  (1991)  “A  comparison  of  several  approximate  algorithms  for 
finding  mult^le  (N-best)  sentence  hypotheses”,  IEEE  Lntematiraal  Ccmfeience  on 
Acoustics,  Speech,  and  Signal  Processing  in  Toronto,  Canada,  May  14-17, 1991. 

Schwartz,  R.,  and  Y-L.  Chow  (1990)  “The  N-Best  Algorithm;  An  Efficient  and  Exact 
Procedure  for  Finding  the  N  Most  Likely  Sentence  Hypotheses”,  IEEE  International 
Ck>nfeience  (xi  Acoustics,  Speech,  and  Signal  Processing  in  Albuquerque,  NM,  April 
3-6,  1990. 

Stallard,  David  “Utufication-Based  Semantic  Interpretation  in  the  BBN  Spedeen  Language 
System”,  in  Speech  and  Natural  Language:  Proceedings  of  a  Workshop  Held  at  Cape 
Cod,  Massachusetts,  October  15-18, 1989,  Mcngan  Kaufinann,  Publishers,  Inc.,  San 
Mateo,  California,  pp.  39-46. 


12.2  Abstracts  Accepted 

Bobrow,  R.,  Robert  Ingria,  and  David  Stallard  "Syntactic/Semantic  Coupling  in  the  DEL^ 
PHI  System”,  Fifth  DARPA  Speech  and  Natural  Language  Workshop,  Arden  House, 
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February  23-26,  1992. 

Bobrow,  R.,  and  David  Stallaid  “Fragment  Processing  in  the  DELPHI  System”,  Rfth 
DARPA  Speech  and  Natural  Language  Wmksh(^,  Arden  House,  February  23-26, 
1992. 


12.3  Presentations 

Bates,  Madeleine  (1991)  talk  presented  at  the  DARPA  SLS  Mid-term  Meeting,  Carnegie 
Mellon  University,  October  21,  1991. 

Ingria,  Rob^  J.  P.  (1989)  “Grammar  Evaluation  in  the  BBN  Spdcen  Language  System”, 
ALLC/ICCH  conference  <»  ‘The  Dynamic  Text”,  Toronto,  June  6,  1989. 

Ingria,  Robert  J.  P.  (1990a)  “Grammar  Develt^nnent  and  Evaluation  in  the  BBN  Sptdoen 
Language  System”,  talk  presented  at  the  University  of  Qucago  Center  for  Infonnation 
and  Language  Studies,  Chicago,  May  21,  1990. 

Ingria,  Robert  J.  P.  (1990b)  “Grammar  Engineering  in  DEI  .ran”,  talk  presented  at  the 
Grammar  Engineering  Workshc^,  Uni^rersity  of  Saarbriicken,  June  22,  1990. 

Ingria,  Robert  J.  P.  (1991a)  “Experiments  widi  Unification  Grammar”,  talk  presented  at 
Cambridge  University,  October  3, 1991  and  the  University  of  Tfibingen,  October  11, 
1991. 

Ingria,  Robert  J.  P.  (1991b)  “BBN  ATIS  Data  (jollection”,  talk  presented  at  the  DARPA 
SLS  Mid-term  Meeting,  Carnegie  Melltm  Universi^,  October  21,  1991. 

Kubala,  Francis  (1991)  talk  presented  at  tire  DARPA  SLS  Mid-term  Meeting,  Carnegie 
Mellon  University,  (X;tober  21,  1991. 


12.4  Meeting  and  Committee  Participation 


We  particq>aled  in  (*  chaired)  numerous  committees,  including  MADCOW,  Ptinc^les  trf 
Interpretation,  *Cknnmon  Lexicon,  *Data  Frnmats,  Semantic  Evaluation,  and  Wockshop 
Planning.  We  attended  the  following  meetings  of  SLS  and  other  community  related  com¬ 
mittees: 


•  On  April  27-28, 1989,  we  attended  a  meeting  at  NIST  on  a  new  common  task  domain 
and  on  Spoken  Language  evaluatimi  criteria. 
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•  On  August  23,  1989,  Bob  Ingiia  attended  a  meeting  at  Bell  Labs  of  the  Steering 
Committee  of  the  IVeeBank  project,  whose  purpose  is  to  collect  a  large  aimotated 
corpus  of  text  for  the  community.  Issues  discussed  include  how  the  corpus  should 
be  tagged,  and  how  those  tags  should  be  entered  while  ensuring  some  degree  of 
consistency. 

•  On  November  15-16,  1989,  we  attended  the  DARPA  planning  meeting  in  Washing¬ 
ton.  The  main  topic  of  discussion  was  the  new  common  task  donuun  and  performance 
testing  procedures. 

•  On  February  28  and  March  1,  1990,  Makhoul  and  Lyn  Bates  attended  die 
DARPA  Coordinating  Committee  meeting  at  TI  in  Dallas,  Texas.  The  main  subject 
was  data  collection  and  performance  evaluadcm  for  the  new  ATIS  domain. 

•  On  July  26-27,  1990,  we  attended  the  SLS  Coordinating  Committee  meeting  held 
at  Dragon  Systems  in  Newton,  MA.  We  gave  a  presentation  on  our  progress  and 
participated  in  the  discussions. 

•  On  August  2,  1990,  Bob  Ingria  attended  a  meeting  at  Bell  Labs  of  the  Steering 
Committee  of  the  *neeBank  project  The  tt^ic  of  die  meeting  was  the  best  method¬ 
ology  for  adding  syntactic  bracketing  to  the  tagged  corpora,  so  as  to  provide  useful 
syntactic  information  while  maintaining  consistency  and  a  high  level  of  throu^put 

•  On  November  2-3,  1990,  Lyn  Bates  attended  and  participated  in  the  DARPA  SLS 
Coordinating  Committee  meeting  held  in  Palo  Alto. 

•  On  March  18,  1991,  we  attended  the  Q»pus  Production  and  Evaluation  Committee 
(CPAC)  held  at  MIT. 

•  On  March  19-20,  1991,  we  attended  the  Coordinating  Committee  Meeting  held  at 
MIT. 

•  On  July  29-30  1991,  we  participated  in  the  DARPA  SLS  Coordinating  Committee 
Meeting,  held  at  AT&T  Bell  Laboratories  in  Murray  Hill,  NJ. 


We  hosted  visits  from  government  personnel  and  other  members  of  the  DARPA  Speech 
and  Natural  Language  community  on  the  following  occasions: 


•  On  April  14,  1989,  Charles  Wayne  (DARPA)  and  David  Pallett  (NIST)  visited  BBN 
for  presentations  about  our  recent  Spoken  Language  Systems  work.  They  were  also 
given  demonstrations  of  and  participated  in  data  acquisition  by  our  t^zard  of  Oz 
simulations  for  the  personnel  database  domain. 

•  On  November  9,  1989,  we  hosted  Cliff  Weinstein,  Victor,  Zue,  and  Jim  Glass,  to 
demonstrate  the  personnel  database  and  our  ^zard  of  Oz  system. 
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•  On  March  6,  1990,  Charles  Wayiw  visited  BBN  in  which  we  reviewed  progress  on 
our  spoken  language  system  work. 

•  On  August  1,  1990,  Charles  Wayne  of  DARPA  visited  BBN  to  review  our  spoken 
language  systems  wtn-k.  We  demonstramd,  for  the  first  time,  a  speaker-independent 
version  of  BYBLOS  nmning  in  near  real-time  on  a  Sun  4  using  a  fully  ctxmected 
statistical  grammar.  We  also  made  several  technical  presentations  and  discussed 
contract  matters. 

•  On  November  7,  1990,  we  gave  a  demonstration  of  our  real-time  HARC  spoken 
language  system  rutming  in  the  DART  dcnnain  to  Tom  Crystal  DARPA/ISTO. 


We  made  the  following  site  visits: 


•  On  August  3, 1989,  we  visited  the  Spoken  Language  Systems  group  at  MIT,  and  saw 
demonstrations  of  their  Voyager  (plication.  We  discussed  the  differences  between 
our  data  collection  paradigms,  where  our  Wizard  of  Oz  approach  obtains  data  to 
be  used  for  research  purposes  ova*  a  long  poiod  of  time,  while  their  frequent  data 
collections  by  students  are  intended  for  short-term  use. 

•  On  September  6-8,  1989,  we  visited  Texas  Instruments  to  deliver  our  Wizard  of  Oz 
system  to  be  used  for  data  collection,  and  trained  them  in  its  use. 


We  participated  in  the  fcdlowing  DARPA  Speech  and  Natural  Language  Workshops: 


•  On  October  15-18,  1989,  we  attended  the  Second  DARPA  Speech  aixl  Natural  Lan¬ 
guage  Workshop  held  at  Cape  Cod,  and  presented  papers  on  unification-based  seman¬ 
tic  interpretation,  evaluation  of  database  query  systems,  developing  statistical  class 
grarrunars  from  limited  data,  die  t^timal  N-best  algorithm,  and  automatic  detection 
of  new  words. 

•  On  June  24-27, 1990,  we  attended  the  Third  DARPA  Speech  and  Natural  Language 
Workshop  held  in  Hidden  Valley,  PA,  and  presented  papers  (»i  SLS  evaluation,  cor  - 
bining  syntax  and  semantics  in  a  unification  grammar,  on  efficient  unification  gram¬ 
mar  parsing,  the  N-Best  algorithm,  and  the  development  of  a  real-time  system. 

•  On  February  19-22,  1991,  we  attended  die  Fourth  DARPA  Speech  and  Natural 
Language  Workshop  held  at  Asilomar  and  presented  papers  on  our  m^ing  units 
approach  to  subcategorization,  our  best  first  statistical  parsing  algorithm,  a  proposal 
for  incremental  dialogue  evaluation,  and  a  discussion  of  our  results  (Hi  the  speech, 
natural  language,  and  spoken  language  benchmark  evaluation. 
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As  part  of  our  participation  in  the  technical  life  of  the  natural  language,  speech  recog¬ 
nition,  and  spoken  language  svstems  communities,  we  attended  the  following  confnences; 


•  On  June  5-8, 1989,  we  attended  the  MUCK-II  workshop  on  Message  Understanding 
at  NOSC  as  observers.  The  workshop  was  useful  in  providing  infonnation  about  that 
the  state  of  the  art  is  in  message  processing. 

•  On  June  26-29,  1989,  we  attended  the  27th  Aimual  Meeting  of  the  Association 
for  Computatiorud  Linguistics  at  the  Uiuversity  of  British  Columbia,  and  presented 
papers  on  our  work  in  the  areas  of  syntax,  semantics,  and  discourse  processing. 

•  On  September  26-27,  1989,  we  attended  die  RADC  Natural  Language  Wtnkshop, 
where  we  made  presentation  on  knowledge  acquisition  and  Spoken  Language  System 
evaluation. 

•  On  November  29-December  1,  1989,  we  attended  the  Natural  Language  Symposium 
hosted  by  BBN,  where  we  made  a  presentation  on  alternatives  in  Spoken  Language 
Systems  research. 

•  On  June  6-9,  1990,  we  attended  the  28th  Aimual  Meeting  of  the  Associaticm  for 
Computational  Linguistics  held  in  Pittsburgh,  and  presented  a  paper  on  the  limits  of 
unification. 

•  On  April  S-6,  1990,  we  attended  the  IEEE  International  Conference  on  Acoustics, 
Speech,  and  Signal  Processing  in  Albuqumque,  NM  and  presented  papers  on  die 
N-Best  algorithm  and  detectitxi  of  out-of-vocabulary  words. 

•  On  May  14-17,  1991,  we  attended  the  IEEE  International  Conference  on  Acoustics, 
Speech,  and  Signal  Processing  in  Toronto,  Canada  and  presented  the  pliers  on  the  N- 
best  algorithm,  the  forward-backward  search  algorithm,  adding  new  wOTds  to  a  large 
vocabulary  continuous  speech  recogniticm  system,  and  a  new  paradigm  for  speaker 
independent  training. 

•  On  June  19-21,  1991,  we  attended  the  29th  Aimual  Meeting  of  the  Association  for 
Computational  Linguistics  in  Berkeley,  California.  At  a  pre-conference  Workshop 
on  Evaluation  of  Natural  Language  Processing  Systems,  we  presented  a  papa*  on  a 
developing  methodology  for  the  evaluation  of  spoken  language  systems. 

•  On  October  21-22,  1991,  Bob  Ingria  participated  in  the  workshop  on  Open  Lexical 
and  Text  Resources  held  at  the  University  of  Pennsylvania. 

•  On  November  28-29,  1991,  Bob  Ingria  participated  in  the  Syntax  Evaluation  Work¬ 
shop  held  at  the  University  of  Pennsylvania.  This  workshop  produced  the  first  woric- 
able  proposal  for  evaluating  the  syntactic  output  of  natural  language  parsing  systems. 
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